# Objective

- To become familiar with some valuable tools and resources from the CUDA Toolkit
  - Compiler flags
  - Debuggers
  - Profilers

<hr style="height:2px">

# GPU Programming Languages

![alt tag](img/3.png)
<hr style="height:2px">

# CUDA - C

![alt tag](img/4.png)
<hr style="height:2px">

# NVCC Compiler

- NVIDIA provides a CUDA-C compiler
  - nvcc
- NVCC compiles device code then forwards code on to the host compiler (e.g. g++)
- Can be used to compile & link host only applications

<hr style="height:2px">

# Example 1: Hello World

```cpp

int main() {
  printf("Hello World!\n");
  return 0;
}

```

![alt tag](img/6.png)
<hr style="height:2px">

# CUDA Example 1: Hello World

```cpp

__global__ void mykernel(void) {}

int main(void) {
  mykernel<<<1,1>>>();
  printf("Hello World!\n");
  return 0;
}

```

![alt tag](img/7.png)
<hr style="height:2px">

# CUDA Example 1: Build Considerations

- Build failed
  - Nvcc only parses .cu files for CUDA
- Fixes:
  - Rename main.cc to main.cu
  OR
  - Treat all input files as .cu files using: nvcc –x cu

![alt tag](img/8.png)
<hr style="height:2px">

# Hello World! with Device Code

```cpp

__global__ void mykernel(void) {}

int main(void) {
  mykernel<<<1,1>>>();
  printf("Hello World!\n");
  return 0;
}

```
#### Output:

```shell

nvcc main.cu
./a.out
Hello World!

```

### mykernel(does nothing, somewhat anticlimactic!)

<hr style="height:2px">

# Developer Tools - Debuggers

![alt tag](img/10.png)

##### https://developer.nvidia.com/debugging-solutions

<hr style="height:2px">

# Compiler Flags

- Remember there are two compilers being used
  - NVCC: Device code
  - Host Compiler: C/C++ code
- NVCC supports some host compiler flags
  - If flag is unsupported, use –Xcompiler to forward to host
    - e.g. –Xcompiler –fopenmp
- Debugging Flags
  - -g: Include host debugging symbols
  - -G: Include device debugging symbols
  - -lineinfo: Include line information with symbols

<hr style="height:2px">

# CUDA-MEMCHECK

- Memory debugging tool
  - No recompilation necessary: %> cuda-memcheck ./exe
  
  
- Can detect the following errors
  - Memory leaks
  - Memory errors (OOB, misaligned access, illegal instruction, etc)
  - Race conditions
  - Illegal Barriers
  - Uninitialized Memory
  
  
- For line numbers use the following compiler flags:
  - -Xcompiler -rdynamic -lineinfo
  
##### http://docs.nvidia.com/cuda/cuda-memcheck

<hr style="height:2px">

# Example 2: CUDA-MEMCHECK

- cuda-gdb is an extension of GDB
  - Provides seamless debugging of CUDA and CPU code


- Works on Linux and Macintosh
  - For a Windows debugger use NSIGHT Visual Studio Edition

##### http://docs.nvidia.com/cuda/cuda-gdb

<hr style="height:2px">

# Example 3: cuda-gdb

![alt tag](img/15.png)
<hr style="height:2px">

# Developer Tools - Profilers

![alt tag](img/16.png)

##### https://developer.nvidia.com/performance-analysis-tools

<hr style="height:2px">

# NVPROF

Command Line Profiler:

- Compute time in each kernel
- Compute memory transfer time
- Collect metrics and events
- Support complex process hierarchy's
- Collect profiles for NVIDIA Visual Profiler
- No need to recompile

<hr style="height:2px">

# Example 4: nvprof

![alt tag](img/18.png)
<hr style="height:2px">

# NVIDIA’s Visual Profiler (NVVP)

![alt tag](img/19.png)
<hr style="height:2px">

# Example 4: NVVP

![alt tag](img/20.png)

![alt tag](img/21.png)

###### Note:
- If kernel order is non-deterministic you can only load the timeline or the metrics but not both.
- If you load just metrics the timeline looks odd but metrics are correct.

### Let’s now generate the same data within NVVP

![alt tag](img/22.png)

<hr style="height:2px">

# NVTX

- Our current tools only profile API calls on the host
  - What if we want to understand better what the host is doing?


- The NVTX library allows us to annotate profiles with ranges
  - Add: #include <nvToolsExt.h>
  - Link with: -lnvToolsExt


- Mark the start of a range
  - nvtxRangePushA(“description”);


- Mark the end of a range
  - nvtxRangePop();


- Ranges are allowed to overlap 

##### http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/
<hr style="height:2px">

# NVTX Profile

![alt tag](img/24.png)
<hr style="height:2px">

# NSIGHT

- CUDA enabled Integrated Development Environment
  - Source code editor: syntax highlighting, code refactoring, etc
  - Build Manger
  - Visual Debugger
  - Visual Profiler


- Linux/Macintosh
  - Editor = Eclipse
  - Debugger = cuda-gdb with a visual wrapper
  - Profiler = NVVP


- Windows
  - Integrates directly into Visual Studio
  - Profiler is NSIGHT VSE

<hr style="height:2px">

# Example 4: NSIGHT

![alt tag](img/26.png)
<hr style="height:2px">

# Profiler Summary

- Many profile tools are available


- NVIDIA Provided
  - NVPROF: Command Line
  - NVVP: Visual profiler
  - NSIGHT: IDE (Visual Studio and Eclipse)
  

- 3rd Party
  - TAU
  - VAMPIR
  

<hr style="height:2px">

# Optimization

![alt tag](img/28.png)
<hr style="height:2px">

# Assess

![alt tag](img/29.png)

- Profile the code, find the hotspot(s)
- Focus your attention where it will give the most benefit

<hr style="height:2px">

# Parallelize

![alt tag](img/30.png)
<hr style="height:2px">

# Optimize

![alt tag](img/31.png)
<hr style="height:2px">

# Bottleneck Analysis

![alt tag](img/32.png)
<hr style="height:2px">

# Performance Analysis

![alt tag](img/33.png)
<hr style="height:2px">

<footer>
<cite> GPU NVIDIA Teaching Kit - University of Illinois </cite>
</footer>