# CUDA OVERVIEW
CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.

In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute intensive portion of the application runs on thousands of GPU cores in parallel. When using CUDA, developers program in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions in the form of a few basic keywords.

The CUDA Toolkit from NVIDIA provides everything you need to develop GPU-accelerated applications. The CUDA Toolkit includes GPU-accelerated libraries, a compiler, development tools and the CUDA runtime.

## Matrix Multiplication Tutorial

![mm](https://www.mathsisfun.com/algebra/images/matrix-multiply-order.gif)

Matrix multiplication is a great to see the power of parallel processing using a GPU. Matrix multiplication is what we like to call _embarrassingly parallel_, which simply means it takes little-to-no effort to separate individual tasks. 

Here is a quick tutorial [www.mathsisfun.com](https://www.mathsisfun.com/algebra/matrix-multiplying.html)

## Hardware Check

First, lets confirm we have access to our GPU. We can do this using [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface).

The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices. 

This utility allows administrators to query GPU device state and with the appropriate privileges, permits administrators to modify GPU device state.  It is targeted at the TeslaTM, GRIDTM, QuadroTM and Titan X product, though limited support is also available on other NVIDIA GPUs.

In [None]:
!nvidia-smi

## Software Check

Second, lets confirm we have access to our software stack.

We will be using the CUDA compiler driver, [NVCC](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html), and the [PGI Compiler](https://www.pgroup.com/products/community.htm).

It is the purpose of nvcc, the CUDA compiler driver, to hide the intricate details of CUDA compilation from developers. It accepts a range of conventional compiler options, such as for defining macros and include/library paths, and for steering the compilation process.

We will be using the PGI tool set for OpenMP and OpenACC. Note, it is possible to do this entire with to PGI compiler, but we'll use both for tutorial purposes.

In [None]:
!nvcc -V

In [None]:
!pgc++ -V

## Normal (Serial - CPU)
Lets begin this tutorial by baselining the serial version of general matrix multiplication, [GEMM](https://spatial-lang.org/gemm). Start by studying [normal_C.cpp](/edit/normal_C.cpp) containing the code. You will notice immediately that there is a set of nested for-loops. This is a telltale sign for areas of your code that have the **potential** to be distributed to a GPU.

Make the following changes to how the execution time changes.

1. Change _N_ to something smaller and larger than 1024. Keep in mind that the algorithm computational complexity is $(n^3)$, <em>N</em> > 1024 can take many seconds to complete.
2. Change the optimization flag between -O0, -O1, -O2, and -O3 to see how compiler optimization effect the execution time.

In [None]:
!nvcc -O2 normal_C.cpp -o normal_C

In [None]:
!./normal_C 1024

## OpenMP
Now, lets jump into [compiler directives](https://en.wikipedia.org/wiki/Directive_(programming)). They are one of the easiest forms 0f optimization techniques and probably the most common for a CPU is [OpenMP](https://www.openmp.org/resources/). Using directives you are giving the compiler hints at compile time where you think further optimizations can be made. Start by studying [openmp.cpp](/edit/openmp.cpp) containing the code. Notice at the **#pragma** statement at line 73. The pragma is a directive telling the compiler *look here*.

In this example, we are giving the compiler the following hint:
`#pragma omp parallel for shared(A, B, C, n) private(i, j, k) schedule(static)`

We are telling the compiler which variables are shared along all the threads and which variables are private among each thread. Schedule tells the compiler the ordering method to execute threads.

With the current code, we need to set the system environment variable OMP_NUM_THREADS, with `export OMP_NUM_THREADS=X`, where X is the number of CPU threads available to use. Notice that in a Jupyter notebook we use the **%** magic command and `env`.

This system environment variable can be overwritten at runtime by using `set_omp_num_threads(X)`, which can be found in the *omp.h* header file.

Make the following changes to how the execution time changes.

1. Change the number of threads available through the system environment variable.
2. Uncomment line 25 and 66, then change runtime variable set_omp_num_threads(X).
3. Change the size of the matrix.

In [None]:
!nvcc -ccbin pgc++ -O2 -Xcompiler "-V19.4 -mp" openmp.cpp -o openmp

In [None]:
%env OMP_NUM_THREADS=4

In [None]:
!./openmp 1024

Try
Removing -mp from build command
Changing omp_set_num_threads( 6 ) to 2, 4, or more.
Changing input paramater `./normal_C 128`. Higher than default, 1024, will take longer.

## OpenACC
[openacc.cpp](/edit/openacc.cpp)

In [None]:
!nvcc -ccbin pgc++ -O2 -Xcompiler "-V19.4 -Bstatic_pgi -acc -ta=tesla:nordc -Minfo=accel -ta=time" openacc.cpp -o openacc

Why error? While if you are running the newest driver, 410.67, profiling requires elevated permissions

In [None]:
!./openacc

## BLAS
[blas.cpp](/edit/blas.cpp)

In [None]:
!nvcc --run -ccbin pgc++ -O2 -Xlinker "-lblas" -Xcompiler "-V19.4" blas.cpp -o blas

In [None]:
!./blas 128

## cuBLAS
[cublas.cpp](/edit/cublas.cpp)

In [None]:
!nvcc --run -O2 -lcublas cublas.cpp -o cublas

In [None]:
!./cublas 128

## CUDA
[cuda.cu](/edit/cuda.cu)

In [None]:
!nvcc --run -O2 cuda.cu -o cuda

In [None]:
!./cuda 128