Before we begin, let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
!nvidia-smi

## Learning objectives
The **goal** of this lab is to:
The goal of this lab is:
- Learn how to use CUDA C to parallelize our code.
- Understand the basic terms and steps involved in making a sequential code parallel.

We do not intend to cover:
- Optimization techniques like memory access patterns, memory hierarchy.

# Introduction
Graphics Processing Units (GPUs) were initially designed to accelerate graphics processing, but in 2007 the release of CUDA introduced GPUs as General Purpose Processors. CUDA is a parallel computing platform and programming model that makes using a GPU for general purpose computing simple and elegant. The developer still programs in the familiar C, C++, Fortran, or an ever expanding list of supported languages, and incorporates extensions of these languages in the form of a few basic keywords.

CUDA C/C++ is:
- Based on a standard C/C++
- A small set of extensions to enable heterogeneous programming
- A straightforward API to manage devices, memory, etc.


# CUDA 


**Heterogeneous Computing:** CUDA is a heterogeneous programming model that includes provisions for both a CPU and GPU. The CUDA C/C++ programming interface consists of C language extensions so that you can target portions of source code for parallel execution on the device (GPU). It is based on a standard C/C++ and provides a library of C functions that can be
executed on the host (CPU) so that it can interact with the device. The two processor that work with each other are: 

- Host: CPU and its memory (Host Memory)
- Device: GPU and its memory  (Device Memory)


Let us look at a Hello World! CUDA C code

```cpp
_global__ void print_from_gpu(void) {
    printf("Hello World! from thread [%d,%d] From device\n", threadIdx.x,blockIdx.x);
}

int main(void) {
    printf("Hello World from host!\n");
    print_from_gpu<<<1,1>>>();
    cudaDeviceSynchronize();
    return 0;
}

```

So you might have already observed that CUDA C is nothing but extensions/constructs to existing language. Let us look at what those additional constructs we introduced above:

- ```__global__``` :This keyword, when added before the function, tells the compiler that this is a function that will run on the device and not on the host. 
- ``` <<<,>>> ``` : This keyword tells the compiler that this is a call to the device function and not the host function. Additionally, the 1,1 parameter basically dictates the number of threads to launch in the kernel. We will cover the parameters inside angle brackets later. 
- ``` threadIdx.x, blockIdx.x ``` : This is a unique ID that's given to all threads. 
- ``` cudaDeviceSynchronize() ``` : All of the kernel(Function that runs on GPU) calls in CUDA are asynchronous in nature. This API will make sure that host does not proceed until all device calls are over.


## GPU Architecture
 
In this section will take an approach of describing the CUDA programming model by showing relationship between the software programming concepts and how do they get mapped to GPU hardware.

The diagram below shows a higher level of abstraction of components of GPU hardware and its respective programming model mapping. 

<img src="../images/cuda_hw_sw.png">

As shown in the diagram above CUDA programming model is tightly coupled with hardware design. This makes CUDA one of the most efficient parallel programming model for shared memory systems. Another way to look at the diagram shown above is given below: 

| Software | Executes  | Hardware |
| --- | --- | --- |
| CUDA thread  | on/as | CUDA Core | 
| CUDA block  | on/as | Streaming Multiprocessor |
| GRID/Kernel  | on/as | GPU Device |

We will get into the concept of _blocks_ and _threads_ in upcoming section. But let us first look at steps involved in writing CUDA code.


## Steps in CUDA Programming

The below table highlights the typical steps which are required to convert sequential code to CUDA code:

| Sequential code | CUDA Code |
| --- | --- |
| **Step 1** Allocate memory on the CPU ( _malloc new_ ) | **Step 1** : Allocate memory on the CPU (_malloc, new_ )|
| **Step 2** Populate/initialize the CPU data | **Step 2** Allocate memory on the GPU, using API like _cudaMalloc()_ |
| **Step 3** Call the CPU function that has the crunching of data. | **Step 3**  Populate/initialize the CPU  |
| **Step 4** Consume the crunched data on Host | **Step 4** Transfer the data from the host to the device with _cudaMemcpy()_ |
| | **Step 5** Call the GPU function with _<<<,>>>_ brackets |
| | **Step 6** Synchronize the device and host with _cudaDeviceSynchronize()_ |
| | **Step 7** Transfer data from the device to the host with _cudaMemcpy()_ |
| | **Step 8** Consume the crunched data on Host |

CPU and GPU memory are different and developer needs to use additional CUDA API to allocate and free memory on GPU. Only device memory can be consumed inside GPU function call (kernel). Linear memory on Device is typically allocated using ```cudaMalloc()``` and freed using ```cudaFree()``` and data transfer between host memory and device memory are typically done using ```cudaMemcpy()```.


The API definition of these are as follows: 

**cudaError_t cudaMalloc (void ∗∗ devPtr, size_t size)** Allocates size bytes of linear memory on the device and returns a pointer to the allocated memory. The allocated memory is suitably aligned for any kind of variable. cudaMalloc() returns ```cudaErrorMemoryAllocation``` in case of failure or ```cudaSuccess```.

**cudaError_t cudaMemcpy (void ∗ dst, const void ∗ src, size_t count, enum cudaMemcpyKind kind)** Copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind is one of ```cudaMemcpyHostToHost```, ```cudaMemcpyHostToDevice```, ```cudaMemcpyDeviceToHost```, or ```cudaMemcpyDeviceToDevice```, and specifies the direction of the copy. Calling cudaMemcpy() with dst and src pointers that do not match the direction of the copy results in an undefined behavior

**cudaError_t cudaFree (void ∗ devPtr)** Frees the memory space pointed to by devPtr, which must have been returned by a previous call to cudaMalloc() or other equivalent API. 

Let us look at these steps in more detail for a simple vector addition code:

<img src="../images/cuda_vec_add2.png">


### Unified Memory
An easier way to allocate memory accessible by the GPU is to use *Unified Memory*. It provides a single memory space accessible by all GPUs and CPUs in the system. To allocate data in unified memory, we call `cudaMallocManaged()`, which returns a pointer that you can access from host (CPU) code or device (GPU) code. To free the data, just pass the pointer to `cudaFree()`. To read more about unified memory, please checkout the blog on [Unified Memory for CUDA beginners](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/).

<img src="../images/unified_memory.png">

Below is the example usage of how to use managed memory in the CUDA code:

```cpp
 // Allocate Unified Memory -- accessible from CPU or GPU
  int *a, *b, *c;
  cudaMallocManaged(&a, N*sizeof(int));
  cudaMallocManaged(&b, N*sizeof(int));
  cudaMallocManaged(&c, N*sizeof(int));
  ...

  // Free memory
  cudaFree(a);
  cudaFree(b);
  cudaFree(c);
```

## Understanding Threads and Blocks
We will be looking at understanding _thread_ and _block_ level parallelism in this section.The number of threads and blocks to be launched is passed as parameter to ```<<<,>>>``` brackets in a kernel call.

### Creating multiple blocks

In order to create multiple blocks for vector addition code above you need to change two things:
1. Change _<<<1,1>>>_ to <<<N,1>>>_ which basically launches N number of blocks
2. Access the array with block index using private variable passed by default to CUDA kernel: _blockIdx.x_

```cpp
//changing from device_add<<<1,1>>> to
device_add<<<N,1>>>
//access the array using blockIdx.x private variable
__global__ void device_add(int *a, int *b, int *c) {
    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
```

By using blockIdx.x to index the array, each block handles a different element of the array and may execute in parallel to each other.

| Block Id | Performs |
| --- | --- |
| Block 0 | _c\[0\]=b\[0\]+a\[0\]_ |
| Block 1 | _c\[1\]=b\[1\]+a\[1\]_ |
| Block 2 | _c\[2\]=b\[2\]+a\[2\]_ |

**Understand and analyze** the sample vector addition code [vector_addition_block.cu](../../source_code/cudac/vector_addition_gpu_block_only.cu).Open the downloaded files for inspection. 



### Creating multiple threads

In order to create multiple threads for vector addition code above. You need to change two things:
1. change _<<<1,1>>>_ to <<<1,N>>>_ which basically launches N number of threads inside 1 block
2. Access the array with thread index using private variable passed by default to CUDA kernel: _threadIdx.x_

```cpp
//changing from device_add<<<1,1>>> to
device_add<<<1,N>>>
//access the array using threadIdx.x private variable
__global__ void device_add(int *a, int *b, int *c) {
    c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
```

By using threadIdx.x to index the array, each thread handles a different element of the array and can execute in parallel.

| thread Id | Performs |
| --- | --- |
| Thread 0 | _c\[0\]=b\[0\]+a\[0\]_ |
| Thread 1 | _c\[1\]=b\[1\]+a\[1\]_ |
| Thread 2 | _c\[2\]=b\[2\]+a\[2\]_ |

**Understand and analyze** the sample vector addition code [vector_addition_thread.cu](../../source_code/cudac/vector_addition_gpu_thread_only.cu).Open the downloaded files for inspection. 


### Creating multiple blocks each having many threads

So far, we've looked at parallel vector addition through the use of several blocks with one thread and one block with several
threads. Now let us look at creating multiple blocks, each block containing multiple threads.

To understand it lets take a scenario where the total number of vector elements is 32 which needs to be added in parallel. Total number of parallel execution unit required is 32. As a first step let us define that each block contains eight threads(we are not saying this is optimal configuration and is just for explanation purpose). Next we define the number of blocks. The simplest calculation is No_Of_Blocks = 32/8 where 8 is number of threads per blocks. The code changes required to launch 4 blocks with 8 thread each is as shown below: 
1. Change _<<<1,1>>>_ to <<<4,8>>>_ which basically launches 4 number of threads per block and 8 total blocks
2. Access the array with both thread index and block index using private variable passed by default to call CUDA kernel: _threadIdx.x_ and _blockIdx.x_ and _bloxkDim.x_ which tells how many threads are allocated per block. 

```cpp
threads_per_block = 8;
no_of_blocks = N/threads_per_block;
device_add<<<no_of_blocks,threads_per_block>>>(d_a,d_b,d_c);

__global__ void device_add(int *a, int *b, int *c) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    c[index] = a[index] + b[index];
}
```

The diagram below shows the launch configuration that we discussed so far:

<img src="../images/cuda_indexing.png">

Modern GPU Architectures consists of multiple SM, each consisting of number of cores. In order to utilize whole GPU it is important to make use of both threads and blocks. 

**Understand and analyze** the sample vector addition code [vector_addition_block_thread.cu](../../source_code/cudac/vector_addition_gpu_thread_block.cu).Open the downloaded files for inspection. 


The more important question which may arise is why bother with threads altogether? What do we gain by adding additional level of parallelism? Short answer is CUDA programming model defines that unlike parallel blocks, threads have mechanisms to efficiently communicate and synchronize.

This is necessary to implement certain algorithms where threads needs to communicate with each other.We do not require synchronization across threads in **Pair Calculation** so we will not be going into details of concept of synchronization across threads and usage of specialized memory like _shared_ memory in this tutorial.  

# Atomic Construct

In the code you will also require one more construct which will help you in getting the right results. OpenACC atomic construct ensures that a particular variable is accessed and/or updated atomically to prevent indeterminate results and race conditions. In other words, it prevents one thread from stepping on the toes of other threads due to accessing a variable simultaneously, resulting in different results run-to-run. For example, if I want to count the number of elements that have a value greater than zero, we could write the following:

```cpp

__global__ void countMoreThanZero( ... )
{
    if ( val > 0 )
    {
        atomicAdd(&cnt[0],1);
    }
}
```

# A Quick Recap
We saw the definition of CUDA and CUDA C. We covered briefly CUDA architecture and introduced CUDA C constructs. Also we played with block and thread configurations for a simple vector addition code. All this was done under the following restrictions:
1. **Multiple Dimension**: We launched threads and blocks in one dimension. We have been using _threadIdx.x_ and _blockIdx.x_, so what is _.x_ ? THis statement basically says that we are launching threads and blocks in one dimension only. CUDA allows to launch threads in 3 dimensions. You can also have _.y_ and _.z_ for index calculation. For example you can launch threads and blocks in 2 dimensions for dividing work for a 2D image. Also the maximum number of threads per block and number of blocks allowed per dimension is restricted based on the GPU that the code is run on.
2. **GPU Memory**: What we have not covered is that GPU has different hierarchy of memory, e.g. GPU has a read only memory which provides high bandwidth for 2D and 3D locality access called _texture_. Also GPU provides a scratch pad limited memory called as _shared memory_
3. **Optimization** : What we did not cover so far is the right way to access the compute and memory to get max performance. 

**One key characteristic about CUDA is that a user can control access pattern of data for each thread. The user can decide which part of memory the data can sits on. While we are covering some part of this in this lab, which is required for us to port our code, we do not intend to cover all optimizations**

## Compile and Run for NVIDIA GPU
Now, lets start modifying the original code and add CUDA C constructs. You can either explicitly transfer the allocated data between CPU and GPU or use unified memory which creates a pool of managed memory that is shared between the CPU and GPU.

Click on the <b>[rdf.cu](../../source_code/cudac/rdf.cu)</b> and <b>[dcdread.h](../../source_code/cudac/dcdread.h)</b> links and modify `rdf.cu` and `dcdread.h`. Remember to **SAVE** your code after changes, before running below cells.

In [None]:
#compile for Tesla GPU
!cd ../../source_code/cudac && nvcc -o rdf rdf.cu

Make sure to validate the output by running the executable and validate the output.

In [None]:
#Run on Nvidia GPU
!cd ../../source_code/cudac && ./rdf && cat Pair_entropy.dat

The output should be the following:

```
s2 value is -2.43191
s2bond value is -3.87014
```

In [None]:
#profile and see output of nvptx
!cd ../../source_code/cudac && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o rdf_cuda ./rdf

Let's checkout the profiler's report. [Download the profiler output](../../source_code/cudac/rdf_cuda.qdrep) and open it via the GUI. Have a look at the example expected profiler report below:

<img src="../images/cuda_profile_timeline.png">

Nsight systems is capable of capturing information about CUDA execution in the profiled process.CUDA API row in the timeline view shows traces of CUDA Runtime and Driver calls made by application. If you hover your mouse over it, you will see more information about the calls.

<img src="../images/cuda_profile_api.png">


Near the bottom of the timeline row tree, the GPU node will appear and contain a CUDA node. Within the CUDA node, each CUDA context used within the process will be shown along with its corresponding CUDA streams. Streams will contain memory operations and kernel launches on the GPU. In the example screenshot below, you can see Kernel launches are represented by blue, while memory transfers are displayed in red and green. In this example screenshot, unified memory was used rather than explicitly transferring data between CPU and GPU.

<img src="../images/cuda_profile.png">


Feel free to checkout the [solution (with managed memory)](../../source_code/cudac/SOLUTION/rdf_unified_memory.cu) or [solution (without managed memory)](../../source_code/cudac/SOLUTION/rdf_malloc.cu)  to help you understand better or compare your implementation with the sample solution.


# CUDA C Analysis

**Usage Scenarios**

Using launguage extensions like CUDA C, CUDA Fortran helps developers get the best performance out of their code on an NVIDIA GPU. CUDA C and other language construct exposes the GPU architecture and programming model which gives more control to developers with respect to memory storage, access and thread control. Based on the type of application it can provide many fold improvement over say compiler generated codes with help of directives. 

**How is CUDA different from other GPU progamming models like OpenACC and OpenMP?**

CUDA C should not be considered an alternative to OpenMP or OpenACC. In fact CUDA complements directive-based programming models and there are defined interoperability strategies between them. You can always start accelerating your code with OpenACC and use CUDA C to optimize the most performance critical kernels. For example use OpenACC for data transfer and then pass a device pointer to one of critical CUDA kernels which is written in CUDA C. 

## Post-Lab Summary

If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page.  This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below.

In [None]:
%%bash
cd ..
rm -f nways_files.zip
zip -r nways_files.zip *

**After** executing the above zip command, you should be able to download the zip file [here](../nways_files.zip). Let us now go back to parallelizing our code using other approaches.

**IMPORTANT**: Please click on **HOME** to go back to the main notebook for *N ways of GPU programming for MD* code.

-----

# <p style="text-align:center;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em"> <a href=../../../nways_MD_start.ipynb>HOME</a></p>

-----


# Links and Resources
[Introduction to CUDA](https://devblogs.nvidia.com/even-easier-introduction-cuda/)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).