# CUDA OVERVIEW
CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.

In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute intensive portion of the application runs on thousands of GPU cores in parallel. When using CUDA, developers program in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions in the form of a few basic keywords.

The CUDA Toolkit from NVIDIA provides everything you need to develop GPU-accelerated applications. The CUDA Toolkit includes GPU-accelerated libraries, a compiler, development tools and the CUDA runtime.

## Matrix Multiplication Tutorial

![mm](https://www.mathsisfun.com/algebra/images/matrix-multiply-order.gif)

Here is a quick tutorial [www.mathsisfun.com](https://www.mathsisfun.com/algebra/matrix-multiplying.html)

## Hardware Check

First, lets confirm we have access to our GPU. We can do this using [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface).

The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices. 

This utility allows administrators to query GPU device state and with the appropriate privileges, permits administrators to modify GPU device state.  It is targeted at the TeslaTM, GRIDTM, QuadroTM and Titan X product, though limited support is also available on other NVIDIA GPUs.

In [None]:
!nvidia-smi

## Software Check

Second, lets confirm we have access to our software stack.

We will be using the CUDA compiler driver, [NVCC](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html), and the [PGI Compiler](https://www.pgroup.com/products/community.htm).

It is the purpose of nvcc, the CUDA compiler driver, to hide the intricate details of CUDA compilation from developers. It accepts a range of conventional compiler options, such as for defining macros and include/library paths, and for steering the compilation process.

We will be using the PGI tool set for OpenMP and OpenACC. Note, it is possible to do this entire with to PGI compiler, but we'll use both for tutorial purposes.

In [None]:
!nvcc -V

In [None]:
!pgc++ -V

## Normal (Serial - CPU)
[normal_C.cpp](/edit/normal_C.cpp)

In [4]:
!nvcc --run normal_C.cpp -O2 -o normal_C

Running with N = 1024

Running Normal C: 1771.51 ms


Try
Changing -O3 to -O0
Changing input paramater `./normal_C 128`. Higher than default, 1024, will take longer.

## OpenMP
[openmp.cpp](/edit/openmp.cpp)

In [7]:
!nvcc --run -ccbin pgc++ -O2 -Xcompiler "-V19.4 -mp" openmp.cpp -o openmp

Running with N = 1024

Running Normal C: 1748.31 ms
Running OpenMP: 562.54 ms
Test passed.


Try
Removing -mp from build command
Changing omp_set_num_threads( 6 ) to 2, 4, or more.
Changing input paramater `./normal_C 128`. Higher than default, 1024, will take longer.

## OpenACC
[openacc.cpp](/edit/openacc.cpp)

In [16]:
!nvcc -ccbin pgc++ -O2 -Xcompiler "-V19.4 -Bstatic_pgi -acc -ta=tesla:nordc -Minfo=accel -ta=time" openacc.cpp -o openacc

openACC(int, float, const float *, const float *, float, float *, const int &):
     60, Generating copyin(A[:n*n])
         Generating copyout(C[:n*n])
         Generating copyin(B[:n*n])
     64, Loop is parallelizable
     66, Loop is parallelizable
         Generating Tesla code
         64, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
         66, #pragma acc loop gang /* blockIdx.y */
         69, #pragma acc loop seq
     69, Loop is parallelizable


Why error? While if you are running the newest driver, 410.67, profiling requires elevated permissions

In [17]:
!./openacc

Running with N = 1024

Running Normal C: 1764.94 ms
Running OpenACC: Test passed.

Accelerator Kernel Timing data
/home/mnicely/git/computeWorks_examples/computeWorks_mm/jupyter/openacc.cpp
  _Z7openACCifPKfS0_fPfRKi  NVIDIA  devicenum=0
    time(us): 26,150
    60: compute region reached 5 times
        66: kernel launched 5 times
            grid: [8x1024]  block: [128]
             device time(us): total=21,136 max=4,235 min=4,220 avg=4,227
            elapsed time(us): total=21,358 max=4,276 min=4,266 avg=4,271
    60: data region reached 10 times
        60: data copyin transfers: 10
             device time(us): total=3,343 max=347 min=330 avg=334
        74: data copyout transfers: 5
             device time(us): total=1,671 max=340 min=331 avg=334


## BLAS
[blas.cpp](/edit/blas.cpp)

In [None]:
!nvcc --run -ccbin pgc++ -O2 -Xlinker "-lblas" -Xcompiler "-V19.4" blas.cpp -o blas

In [None]:
!./blas 128

## cuBLAS
[cublas.cpp](/edit/cublas.cpp)

In [None]:
!nvcc --run -O2 -lcublas cublas.cpp -o cublas

In [None]:
!./cublas 128

## CUDA
[cuda.cu](/edit/cuda.cu)

In [None]:
!nvcc --run -O2 cuda.cu -o cuda

In [None]:
!./cuda 128