# CUDA OVERVIEW
CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.

In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute intensive portion of the application runs on thousands of GPU cores in parallel. When using CUDA, developers program in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions in the form of a few basic keywords.

The CUDA Toolkit from NVIDIA provides everything you need to develop GPU-accelerated applications. The CUDA Toolkit includes GPU-accelerated libraries, a compiler, development tools and the CUDA runtime.

The following notebook is only meant to expose the reader to the various ways to accelerate an application using GPUs. For more in depth training, please visit the [NVIDIA Deep Learning Institue](https://www.nvidia.com/en-us/deep-learning-ai/education/). There you can access a wealth of resources on Deep Learning and Accelerated Computing Courses. And don't forget to signup for a [NVIDIA Developer Account](https://developer.nvidia.com/developer-program) to get access to all the latest information and toolsets!!!

To setup your system to fully utilize this notebook, please follow the instructions at [README.md](https://github.com/mnicely/computeWorks_examples/tree/master/computeWorks_mm).

## Matrix Multiplication Tutorial

![mm](https://www.mathsisfun.com/algebra/images/matrix-multiply-order.gif)

Matrix multiplication is a great to see the power of parallel processing using a GPU. Matrix multiplication is what we like to call _embarrassingly parallel_, which simply means it takes little-to-no effort to separate individual tasks. 

Here is a quick tutorial [www.mathsisfun.com](https://www.mathsisfun.com/algebra/matrix-multiplying.html)

## Hardware Check

First, lets confirm we have access to our GPU. We can do this using [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface).

The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices. 

This utility allows administrators to query GPU device state and with the appropriate privileges, permits administrators to modify GPU device state.  It is targeted at the TeslaTM, GRIDTM, QuadroTM and Titan X product, though limited support is also available on other NVIDIA GPUs.

To run a block of code below, click on it to select it and then you can either click the run (button with triangle) button in the menu above or type Ctrl+Enter:

In [None]:
!nvidia-smi

## Software Check

Second, lets confirm we have access to our software stack.

We will be using the CUDA compiler driver, [NVCC](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html), and the [PGI Compiler](https://www.pgroup.com/products/community.htm).

It is the purpose of nvcc, the CUDA compiler driver, to hide the intricate details of CUDA compilation from developers. It accepts a range of conventional compiler options, such as for defining macros and include/library paths, and for steering the compilation process.

We will be using the PGI tool set for OpenMP and OpenACC. Note, it is possible to do this entire with to PGI compiler, but we'll use both for tutorial purposes.

In [None]:
!nvcc -V

In [None]:
!pgc++ -V

## Normal (Serial - CPU)
Lets begin this tutorial by baselining the serial version of general matrix multiplication, [GEMM](https://spatial-lang.org/gemm). Start by studying [normal_C.cpp](/edit/normal_C.cpp) containing the code. You will notice in the *normalC*() function that there is a set of nested for-loops. This is a telltale sign for areas of your code that have the **potential** to be distributed to a GPU.

Make the following changes to how the execution time changes.

1. Change _N_ to something smaller and larger than 1024. Keep in mind that the algorithm computational complexity is $(n^3)$, <em>N</em> > 1024 can take many seconds to complete.
2. Change the optimization flag between -O0, -O1, -O2, and -O3 to see how compiler optimization effect the execution time.

In [None]:
!nvcc -O2 normal_C.cpp -o normal_C

In [None]:
!./normal_C 1024

## OpenMP
Now, lets jump into [compiler directives](https://en.wikipedia.org/wiki/Directive_(programming)). They are one of the easiest forms 0f optimization techniques and probably the most common for a CPU is [OpenMP](https://www.openmp.org/resources/). Using directives you are giving the compiler hints at compile time where you think further optimizations can be made. Start by studying [openmp.cpp](/edit/openmp.cpp) containing the code. Notice at the **#pragma** statement at line 73. The pragma is a directive telling the compiler *look here*.

In this example, we are giving the compiler the following hint:
`#pragma omp parallel for shared(A, B, C, n) private(i, j, k) schedule(static)`

We are telling the compiler which variables are shared along all the threads and which variables are private among each thread. Schedule tells the compiler the ordering method to execute threads.

With the current code, we need to set the system environment variable OMP_NUM_THREADS, with `export OMP_NUM_THREADS=X`, where X is the number of CPU threads available to use. Notice that in a Jupyter notebook we use the **%** magic command and `env`.

This system environment variable can be overwritten at runtime by using `set_omp_num_threads(X)`, which can be found in the *omp.h* header file.

Lastly, we need to pass some additional flags to the nvcc compiler. In this tutorial, we are using the OpenMP library from the PGI compiler. Therefore, we need to clarify that we want to use the PGI C++ compiler for the host code. This is done with `-ccbin pgc++`. Now that we have chosen the compiler for the host code, we need to let the compiler know that we want to use PGI version 19.4 and link its OpenMP library using the `-mp` flag.

Make the following changes to how the execution time changes.

1. Change the number of threads available through the system environment variable.
2. Change the optimization flag between -O0, -O1, -O2, and -O3.
3. Uncomment line 25 and 66, then change the number of threads in `set_omp_num_threads(X)`.
4. Set the matrix to something smaller and larger than 1024.
5. Remove -mp from the build command.

In [63]:
!nvcc -ccbin pgc++ -O2 -Xcompiler "-V19.4 -mp" openmp.cpp -o openmp

In [None]:
%env OMP_NUM_THREADS=1

In [None]:
!./openmp 1024

## OpenACC
As we just saw, utilizing parallel resources can greatly improve execution time of an application. Using OpenMP we were able to gain access to multiple cores on the CPU. But how do we access the hundreds, and even thousands, of cores on a GPU using compiler directives? That's where [OpenACC](https://developer.nvidia.com/openacc) comes in! This is great to legacy code!

Just like OpenMP, OpenACC allows you create highly optimized code with little programming and low level knowledge of a GPU. Start by studying [openacc.cpp](/edit/openacc.cpp) containing the code. In the openACC function, you will see **#pragma** statements just like we had in OpenMP. Using OpenACC we have finer granuality when assigning our code to the GPU. For that reason we need to give the compiler a few more hints. With these additional hints we tell the compiler where any dependences are in our code. The compiler will try it's best if these hints aren't there.

To use OpenACC, we need to pass the `-acc` flag to the compiler, just like when we passed `-mp` to use OpenMP. We can pass some more flags for increased functionality. First, we let the compiler know that we want to use a NVIDIA GPU with `-ta=tesla:nordc`. When we add `-Minfo=accel`, the compiler will display optimization analysis during compilcation. It will do this everytime it see `#pragma acc kernels`, which is a hint to compiler to analyze that region of code for optimizations. 

What if we want to profile the code for performance data? Your initial thought maybe be to wrap the OpenACC function with timers, like with did with the serial code. While this method will certainly give you general execution time, it leaves a lot of room for error and users are advised not to use this method. Instead, we can pass the `-ta=time` flag to the compiler without writing any extra code. Using this flag will also give us a breakdown of compute time and transfer times to and from the GPU. The flag `-Bstatic_pgi` links the required libraries for profiling. 

Notice that we don't need to set any system environment variables.

Please visit []() for an in-depth tutorial on OpenACC!

Make the following changes to how the execution time changes.

1. Change the optimization flag between -O0, -O1, -O2, and -O3.
2. Set the matrix to something smaller and larger than 1024.
3. Remove -acc from the build command.
4. Remove accel from the -Minfo flag in the build command.
5. Remove -ta=time from the build command

In [64]:
!nvcc -ccbin pgc++ -O2 -Xcompiler "-V19.4 -acc -ta=tesla:nordc -Minfo=accel -Bstatic_pgi -ta=time" openacc.cpp -o openacc

openACC(int, float, const float *, const float *, float, float *, const int &):
     60, Generating copyin(A[:n*n])
         Generating copyout(C[:n*n])
         Generating copyin(B[:n*n])
     64, Loop is parallelizable
     66, Loop is parallelizable
         Generating Tesla code
         64, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
         66, #pragma acc loop gang /* blockIdx.y */
         69, #pragma acc loop seq
     69, Loop is parallelizable


Explain compiler output

In [65]:
!./openacc 1024

Running with N = 1024

Running Normal C: 1719.26 ms
Running OpenACC: Test passed.

Accelerator Kernel Timing data
/home/mnicely/git/computeWorks_examples/computeWorks_mm/jupyter/openacc.cpp
  _Z7openACCifPKfS0_fPfRKi  NVIDIA  devicenum=0
    time(us): 26,168
    60: compute region reached 5 times
        66: kernel launched 5 times
            grid: [8x1024]  block: [128]
             device time(us): total=21,101 max=4,221 min=4,219 avg=4,220
            elapsed time(us): total=21,286 max=4,271 min=4,247 avg=4,257
    60: data region reached 10 times
        60: data copyin transfers: 10
             device time(us): total=3,343 max=341 min=330 avg=334
        74: data copyout transfers: 5
             device time(us): total=1,724 max=357 min=335 avg=344


Explain Profiling output

## BLAS
An alternative to compiler directives is to use a drop-in [library](). Libraries are optimized?? They are considered drop-in because that they can added with a simple include.

The [Basic Linear Algerba Subroutines]() (BLAS) library is probably one of the most highly optimized library in the world. It also include a function specifically for GEMM calculations! Start by studying [blas.cpp](/edit/blas.cpp) containing the code. You will notice in the blas function that there are no longer any for loops. That is because they are *hidden* in the BLAS library. All we have to do is call the *blas_sgemm*() function and pass the appropiate parameters.

In this example, we need to link the BLAS library with `-Xlinker "-lblas"`.

Make the following changes to how the execution time changes.

1. Change the optimization flag between -O0, -O1, -O2, and -O3.
2. Set the matrix to something smaller and larger than 1024.

In [None]:
!nvcc -ccbin pgc++ -O2 -Xlinker "-lblas" -Xcompiler "-V19.4" blas.cpp -o blas

In [None]:
!./blas 1024

## cuBLAS
[cublas.cpp](/edit/cublas.cpp)

In [None]:
!nvcc -O2 -lcublas cublas.cpp -o cublas

In [None]:
!./cublas 1024

## CUDA
[cuda.cu](/edit/cuda.cu)

In [None]:
!nvcc -O2 cuda.cu -o cuda

In [None]:
!./cuda 1024