

# 

#  Numba Lab1: Numba For CUDA GPU
---

## Learning Objectives
- **The goal of this lab is to:**
    -   enable you to quickly start using Numba (beginner to advanced level)
    -   teach you to apply the concepts of CUDA GPU programming to HPC field(s); and
    -   show you how to achieve computational speedup on GPUs to maximize the throughput of your HPC implementation.


Before we begin, let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by clicking on it with your mouse, and pressing Ctrl-Enter, or pressing the play button in the toolbar above. You should see some output returned below the grey cell.

In [None]:
!nvidia-smi

     
##  Introduction
- Numba is a just-in-time (jit) compiler for Python that works best on code that uses NumPy arrays, functions, and loops. Numba has sets of decorators that can be specified at the top of user-defined functions to determine how they are compiled.  
- Numba supports CUDA GPU programming model. Decorated function written in python is compiled into a CUDA kernel to speed up the execution rate. 
- A kernel written in Numba automatically has direct access to NumPy arrays. This shows great support for data visibility between the host (CPU) and the device (GPU). 


###  Definition of Terms
- The CPU is called a **Host**.  
- The GPU is called a **Device**.
- A GPU function launched by the host and executed on the device is called a **Kernel**.
- A GPU function executed on the device and can only be called from the device is called a **Device function**.

### Note
- It is recommended to visit the NVIDIA official documentation web page and read through [CUDA C programming guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide), because most CUDA programming features exposed by Numba map directly to the CUDA C language offered by NVIDIA. 
- Numba does not implement these CUDA features:
     - dynamic parallelism
     - texture memory

## CUDA Kernel
- In CUDA, written code can be executed by hundreds or thousands of threads at a single run, hence, a solution is modeled after the following thread hierarchy: 
    - **Grid**: A kernel executed as a collection of blocks. 
    - **Thread Block**: Collection of threads that can communicate via shared memory. Each thread is executed by a core.
    - **Thread**: Single execution units that run kernels on GPU.
- Numba exposes three kinds of GPU memory: 
    - global device memory  
    - shared memory 
    - local memory 
- Memory access should be carefully considered in order to keep bandwidth contention at minimal.

 <img src="../images/thread_blocks.JPG"/> <img src="../images/memory_architecture.png"/> 

### Kernel Declaration
- A kernel function is a GPU function that is called from a CPU code. It requires specifying the number of blocks and threads per block and cannot explicitly return a value except through a passed array. 
- A kernel can be called multiple times with varying number of blocks per grid and threads per block after it has been compiled once.

Example:

```python
@cuda.jit
def arrayAdd(array_A, array_B, array_out):
    #...code body ...
```
###### Kernel Invocation
- A kernel is typically launched in the following way:
```python
threadsperblock = 128
N = array_out.size
blockspergrid = ( N + (threadsperblock - 1))// threadsperblock
arrayAdd[blockspergrid, threadsperblock](array_A, array_B, array_out)
```

###### Choosing Block Size
- The block size determines how many threads share a given area of the shared memory.
- The block size must be large enough to accommodate all computation units. See more details [here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/).

### Thread Positioning 
- When running a kernel, the kernel function’s code is executed by every thread once. Therefore, it is important to uniquely identify distinct threads.
- The default way to determine a thread position in a grid and block is to manually compute the corresponding array positions:


<img src="../images/thread_position.png"/>


```python
threadsperblock = 128
N = array_out.size

@cuda.jit
def arrayAdd(array_A, array_B, array_out):
    tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    if tid < N: #Check array boundaries
        array_out[tid] =  array_A[tid] + array_B[tid]

#Unless you are sure the block size and grid size are a divisor of your array size, you must check boundaries as shown in the code block above. 
```
### Example 1: Addition on 1D-Arrays


In [None]:
import numba.cuda as cuda
import numpy as np

N = 500000
threadsperblock = 1000

@cuda.jit()
def arrayAdd(array_A, array_B, array_out):
    tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    if tid < N:
        array_out[tid] = array_A[tid] + array_B[tid]


        
array_A = np.arange(N, dtype=np.int32)
array_B = np.arange(N, dtype=np.int32)
array_out = np.zeros(N, dtype=np.int32)

blockpergrid  = N + (threadsperblock - 1) // threadsperblock

arrayAdd[blockpergrid, threadsperblock](array_A, array_B, array_out)

print("result: {} ".format(array_out))

**From Example 1:** 
> - N is the size of the array and the number of threads in a single block is 128.
> - The **cuda.jit()** decorator indicates that the function (arrayAdd) below is a device kernel and should run parallel. The **tid** is the estimate of a unique index for each thread in the device memory grid: 
>> **tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x**.
> - **array_A** and **array_B** are input data, while **array_out** is the output array and is already preload with zeros.
> - The statement **blockpergrid  = N + (threadsperblock - 1) // threadsperblock** computes the size of block per grid. This line of code is commonly use as the default formular to estimate the number of blocks per grid in GPU programming documentations.
> - **arrayAdd[blockpergrid, threadsperblock](array_A, array_B, array_out)** indicate a call to a kernel function **arrayAdd** having the number of blocks per grid and number of threads per block in a square bracket, while kernel arguments are in a round bracket.




###  Matrix Multiplication on 2D Array 

<img src="../images/2d_array.png"/>

<img src="../images/2d_col_mult.png"/>

> **Note**
> - **Approach 2** would not be possible if the matrix size exceeds the maximum number of threads per block on the device, while **Approach 1** would continue to execute. The latest GPUs have maximum of 1024 threads per thread block. 

### Example 2:  Matrix multiplication 

In [None]:
import numba.cuda as cuda
import numpy as np
import math

N = 4
@cuda.jit()
def MatrixMul2D(array_A, array_B, array_out):
   row, col = cuda.grid(2)
   if row < array_out.shape[0] and col < array_out.shape[1]:
      for k in range(N):
         array_out[row][col]+= array_A[row][k] * array_B[k][col]


array_A   = np.array([[0,0,0,0],[1,1,1,1],[2,2,2,2],[3,3,3,3]], dtype=np.int32)
array_B   = np.array([[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]], dtype=np.int32)
array_out = np.zeros(N*N, dtype=np.int32).reshape(N, N)

threadsperblock = (2,2)
blockpergrid_x  = (math.ceil( N / threadsperblock[0]))
blockpergrid_y  = (math.ceil( N / threadsperblock[1]))
blockpergrid    = (blockpergrid_x, blockpergrid_y)

MatrixMul2D[blockpergrid,threadsperblock](array_A, array_B, array_out)

print("array_A:\n {}\n".format(array_A))
print("array_B:\n {}\n".format(array_B))
print("array_A * array_B:\n {}".format(array_out))

#Note
#The cuda.grid() returns the thread ID in X and Y (row & col) direction of the memory grid


### Example 3: A 225 × 225 Matrix Multiplication

In [None]:
N = 225

@cuda.jit()
def MatrixMul2D(array_A, array_B, array_out):
   x, y = cuda.grid(2)
   if x < array_out.shape[0] and y < array_out.shape[1]:
      for k in range(N):
         array_out[x][y] += array_A[x][k] * array_B[k][y]

threadsperblock = (25,25)
array_A = np.arange((N*N), dtype=np.int32).reshape(N,N)
array_B = np.arange((N*N), dtype=np.int32).reshape(N,N)
array_out = np.zeros((N*N), dtype=np.int32).reshape(N,N)

blockpergrid_x  = (math.ceil( N / threadsperblock[0]))
blockpergrid_y  = (math.ceil( N / threadsperblock[1]))
blockpergrid    = (blockpergrid_x, blockpergrid_y)

MatrixMul2D[blockpergrid,threadsperblock](array_A, array_B, array_out)

print(array_out)

### Thread Reuse 

- It is possible to specify a few numbers of threads for a data size such that threads are reused to complete the computation of the entire data. This is one of the approaches used when a data to be computed is larger than the maximum number of threads available in a device memory. 
- This statement is used in a while loop: ***tid += cuda.blockDim.x * cuda.gridDim.x***
- An example is given below to illustrate thread reuse. In the example, a small number of threads is specified on purpose in order to show the possibility of this approach. 


#### Example 4: 

In [None]:
import numba.cuda as cuda
import numpy as np

N = 500000
threadsperblock = 1000

@cuda.jit
def arrayAdd(array_A, array_B, array_out):
   tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
   while tid < N:
      array_out[tid] = array_A[tid] + array_B[tid]
      tid += cuda.blockDim.x * cuda.gridDim.x

array_A = np.arange(N, dtype=np.int32)
array_B = np.arange(N, dtype=np.int32)
array_out = np.zeros(N, dtype=np.int32)

arrayAdd[1, threadsperblock](array_A, array_B, array_out)

print("result: {} ".format(array_out))



> **Note**
> - The task in **example 4** is the same as in **example 1** but with limited number of threads specified, however, the same result was achieved. 
> - Note that this approach may delegate more threads than required. In the code above, an excess of 1 block of threads may be delegated.


## Memory Management

### Data Transfer 
- When a kernel is executed, Numba automatically transfers NumPy arrays to the device and vice versa.
- In order to avoid the unnecessary transfer for read-only arrays, the following APIs can be used to manually control the transfer.

##### 1.  Copy host to device
```python
import numba.cuda as cuda
import numpy as np

N = 500000
h_A = np.arange(N, dtype=np.int)
h_B = np.arange(N, dtype=np.int)
h_C = np.zeros(N, dtype=np.int)

d_A = cuda.to_device(h_A)
d_B = cuda.to_device(h_B)
d_C = cuda.to_device(h_C)
```
##### 2.  Enqueue the transfer to a stream
```python
h_A    = np.arange(N, dtype=np.int)
stream = cuda.stream()
d_A    = cuda.to_device(h_A, stream=stream)
```
##### 3.  Copy device to host / enqueue the transfer to a stream 
```python
h_C = d_C.copy_to_host()
h_C = d_C.copy_to_host(stream=stream)
```
### Example 5:  Data Movement 

In [None]:
import numba.cuda as cuda
import numpy as np
N = 200
threadsperblock = 25

@cuda.jit
def arrayAdd(d_A, d_B, d_C):
   tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
   if tid < N:
      d_C[tid] = d_A[tid] + d_B[tid]
      
h_A = np.arange(N, dtype=np.int32)
h_B = np.arange(N, dtype=np.int32)
h_C = np.zeros(N, dtype=np.int32)

d_A = cuda.to_device(h_A)
d_B = cuda.to_device(h_B)
d_C = cuda.to_device(h_C)

blockpergrid  = N + (threadsperblock - 1) // threadsperblock
arrayAdd[blockpergrid, threadsperblock](d_A, d_B, d_C)

h_C = d_C.copy_to_host()
print(h_C)


## Atomic Operation

- Atomic operation is required when multiple threads attempt to modify a common portion of the memory. 
- A typical example includes simultaneous withdrawal from a bank account through ATM machine or a large number of threads modifying a particular index of an array based on certain condition(s).
- List of presently implemented atomic operations supported by Numba are:
> **import numba.cuda as cuda**
> - cuda.atomic.add(array, index, value)
> - cuda.atomic.min(array, index, value)
> - cuda.atomic.max(array, index, value)
> - cuda.atomic.nanmax(array, index, value)
> - cuda.atomic.nanmin(array, index, value)
> - cuda.atomic.compare_and_swap(array, old_value, current_value)
> - cuda.atomic.sub(array, index, value)

In [None]:
# Task ==> sum of an array: [1,2,3,4,5,6,7,8,9,10] in parallel
# Note that threads are executed randomly

# atomic operation example 
size = 10
nthread = 10
@cuda.jit()
def add_atomic(my_array, total):
   tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
   cuda.atomic.add(total,0, my_array[tid])

my_array = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.int32)
total = np.zeros(1, dtype=np.int32)
nblock = int(size / nthread)
add_atomic[nblock, nthread](my_array, total)
print("Atomic:", total)

######################################################################################
# Non-atomic operation example  
size = 10
nthread = 10
@cuda.jit()
def add_atomic(my_array, total):
   tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
   total[0] += my_array[tid]
   

my_array = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.int32)
total = np.zeros(1, dtype=np.int32)
nblock = int(size / nthread)
add_atomic[nblock, nthread](my_array, total)
print("Non atomic: ", total)



### 7. CUDA Ufuncs

- The CUDA ufunc supports passing intra-device arrays to reduce traffic over the PCI-express bus. 
- It also supports asynchronous mode by using stream keyword.

<img src="../images/ufunc.png"/>

In [None]:
# example: c = (a - b) * (a + b)
# size of each array(A, B, C) is N = 10000

from numba import vectorize
import numba.cuda as cuda
import numpy as np

@vectorize(['float32(float32, float32)'],target='cuda')
def compute(a, b):
    return (a - b) * (a + b)

N = 10000
A = np.arange(N , dtype=np.float32)
B = np.arange(N, dtype=np.float32)
C = compute(A, B)

print(C.reshape(100,100))

### Device Function

- The CUDA device functions can only be invoked from within the device and can return a value like normal functions. The device function is usually placed before the CUDA ufunc kernel otherwise a call to the device function may not be visible inside the ufunc kernel.
- The attributes <i>device=True</i> and <i>inline=true</i> indicate that <i>"device_ufunc"</i> is a device function.

In [None]:
#example: c = sqrt((a - b) * (a + b))

from numba import vectorize
import numba.cuda as cuda
import numpy as np
import math

@cuda.jit('float32(float32)', device=True, inline=True)
def device_ufunc(c):
   return math.sqrt(c)

@vectorize(['float32(float32, float32)'],target='cuda')
def compute(a, b):
    c = (a - b) * (a + b)
    return device_ufunc(c)


## Summary

<img src="../images/numba_summary1.png"/>


---

## Lab Task

In this section, you are expected to click on the **Serial Code Lab Assignment** link and proceed to Lab 2. In this lab you will find three python serial code functions. You are required to revise the **pair_gpu** function to run on the GPU, and likewise do a few modifications within the **main** function.

## <div style="text-align:center; color:#FF0000; border:3px solid red;height:80px;"> <b><br/> [Serial Code Lab Assignment](serial_RDF.ipynb) </b> </div>

---

## Post-Lab Summary

If you would like to download this lab for later viewing, we recommend you go to your browser's File menu (not the Jupyter notebook file menu) and save the complete web page. This will ensure the images are copied as well. You can also execute the following cell block to create a zip-file of the files you've been working on and download it with the link below.



In [None]:
%%bash
cd ..
rm -f nways_files.zip
zip -r nways_files.zip *


**After** executing the above zip command, you should be able to download the zip file [here](../nways_files.zip).

**IMPORTANT**: Please click on **HOME** to go back to the main notebook for *N ways of GPU programming for MD* code.

---

# <p style="text-align:center;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em"> <a href=../../../nways_MD_start_python.ipynb>HOME</a></p>

---


# Links and Resources

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight System profiler output, please download the latest version Nsight System from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

---


## References

- Numba Documentation, Release 0.52.0-py3.7-linux-x86_64.egg, Anaconda, Nov 30, 2020.
- Bhaumik Vaidya, Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA, Packt Publishing, 2018.
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/


--- 

## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0  International (CC BY 4.0).