# Section 4: Introduction to High-Performance Computing on GPU

## 4.1. What's a GPU?


![](https://github.com/bsotomayorg/Intro_HPC_Python/blob/main/notebooks/imgs/slides_d2/065.PNG?raw=1)



+ It comes from **G**raphics **P**rocessing **U**nit.
+ Specialized processor dedicated to graphics processing tasks.

## 4.2. Differences between CPU and GPU

**CPU:** 
+ Multiple cores.
+ Complex control logic.
+ Optimized for serial operations.

**GPU:** 
+ Many parallel executing units (ALUs).
+ Best known use case: Graphics.

## 4.3 Compute Unified Device Architecture (CUDA)

### 4.3.1. What's CUDA?



<img src="https://github.com/bsotomayorg/Intro_HPC_Python/blob/main/notebooks/imgs/slides_d2/070.PNG?raw=1" width="70%" height="70%" />

+ Cuda is a software layer that gives direct access to the GPU’s virtual instruction set and parallel computational elements to execute functions, called *kernels*.
+ CUDA is indicated as a General-Purpose computing on GPUs (GPGPU).
+ GPUs traditionally handle computations for computer graphics.

### 4.3.2. CUDA's Program Flow

<img src="https://github.com/bsotomayorg/Intro_HPC_Python/blob/main/notebooks/imgs/slides_d2/071.PNG?raw=1" width="75%" height="75%" />

1. Load data on Host.
2. Allocate device memory.
3. Copy data from Host to Device.
4. Execute divece *kernels* to process data.
5. Copy results from Device to Host memory.

## 4.4. Parallel GPU computing with Python

### 4.4.1. Universal functions (`ufunc`)

A universal function is a function that operates on `ndarrays` in an element-by-element fashion, supporting array broadcasting, type casting, and several other standard features. 

A `ufunc` is a "vectorized" wrapper for a function that takes a fixed number of specific inputs and produces a fixed number of specific outputs. 

(_Source: [Numpy Documentation](https://numpy.org/doc/stable/reference/ufuncs.html)_).

#### 4.4.1.1. Example: My first vectorized function

In [None]:
from numba import vectorize 
import math

In [None]:
# generating data
num_points = int(1e6) # 1 million of points

In [None]:
@vectorize
def cpu_sqrt(x):
    return math.sqrt(x)

In [None]:
@vectorize(['float32(float32)'], target='cuda')
def gpu_sqrt(x):
    return math.sqrt(x)

#### 4.4.1.2. Allowing multiple signatures in vectorized functions

Numba's vectorized functions can allow more than one data type as input. In that case, we will need to add another signature as the input parameter of the vectorize decorator.

For example:

In [None]:
@vectorize(['int32(int32, int32)', 'float64(float64, float64)'])
def my_ufunc(x, y):
    return x+y+math.sqrt(x*math.cos(y))

In [None]:
@vectorize(['int32(int32, int32)', 'float64(float64, float64)'])
def my_ufunc(x, y):
    return np.abs(x-y)

In [None]:
a = np.arange(1.0, 10.0, dtype='f8')
b = np.arange(1.1, 10.1, dtype='f8')
print(my_ufunc(a, b))

[0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]


In [None]:
a = np.arange(1, 10, dtype='i4')
b = np.arange(2, 11, dtype='i4')
print(my_ufunc(a, b))

[1 1 1 1 1 1 1 1 1]


#### 4.4.1.4. Exercise

Create two vectorized functions: `my_cpu_ufunc` and `my_gpu_func`. Both receive two arrays `x` and `y` as input parameters.

Make `my_cpu_ufunc` and `my_gpu_ufunc` running in CPU and GPU, respectively.

#### 4.4.1.5. Solution

In [None]:
from numba import vectorize 

In [None]:
# generating data
a = np.arange(1.0, 10.0)
b = np.ones(shape=a.shape[0])

In [None]:
# add the decorartor here!
def my_cpu_ufunc(x, y):
    return abs(x-y)

In [None]:
# add the decorartor here!
def my_gpu_ufunc(x, y):
    return abs(x-y)

Try them!

In [None]:
%time my_cpu_ufunc(a, b)

CPU times: user 25 µs, sys: 4 µs, total: 29 µs
Wall time: 33.4 µs


array([0., 1., 2., 3., 4., 5., 6., 7., 8.])

In [None]:
%time my_gpu_ufunc(a, b)

CPU times: user 4.34 ms, sys: 33 µs, total: 4.37 ms
Wall time: 4.51 ms


array([0., 1., 2., 3., 4., 5., 6., 7., 8.])

How was the performance of CPU vs GPU vectorization?

#### 4.4.1.5 Solution

In [None]:
import numpy as np

In [None]:
from numba import vectorize 
import math

In [None]:
@vectorize(['float64(float64, float64)'], target='cpu')
def my_cpu_ufunc(x, y):
    return abs(x-y)

In [None]:
@vectorize(['float64(float64, float64)'], target='cuda')
def my_gpu_ufunc(x, y):
    return abs(x-y)

In [None]:
a = np.arange(1.0, 10.0)
b = np.ones(shape=a.shape[0])

In [None]:
# Calls compiled version of my_ufunc for each element of a and b
print(my_cpu_ufunc(a, b))

[0. 1. 2. 3. 4. 5. 6. 7. 8.]


In [None]:
# Calls compiled version of my_ufunc for each element of a and b
print(my_gpu_ufunc(a, b))

[0. 1. 2. 3. 4. 5. 6. 7. 8.]


----

## 4.4.2.  GPU's Device functions

### 4.4.2.1. Introduction

Remember the CUDA program flow? We can have control of the data transfering of our data from/to GPU with GPU's device functions. 

<img src="https://github.com/bsotomayorg/Intro_HPC_Python/blob/main/notebooks/imgs/slides_d2/071.PNG?raw=1" width="70%" height="70%" />

These functions are compiled functoins executed on GPU.

### 4.4.2.2. Example

In [None]:
from numba import vectorize
import numpy as np

In [None]:
@vectorize(['int16(int16, int16)'], target='cuda')
def a_device_function(x, y):
    return x + y

In [None]:
n = 10_000
x = np.ones(shape=n, dtype=np.int16)
y = x*2
print(x)
print(y)

[1 1 1 ... 1 1 1]
[2 2 2 ... 2 2 2]


In [None]:
# transfer inputs to the gpu
x_gpu = cuda.to_device(x)
y_gpu = cuda.to_device(y)

In [None]:
# creating out array on GPU
z_gpu = cuda.device_array(shape=(n,), dtype=np.int16)

In [None]:
a_device_function(x_gpu, y_gpu, out=z_gpu)

<numba.cuda.cudadrv.devicearray.DeviceNDArray at 0x7f94f73de450>

In [None]:
z = z_gpu.copy_to_host()

In [None]:
print(z)

[3 3 3 ... 3 3 3]


----

# Troubleshooting

#### Installing Numba + CUDA on Google Colab!

`(src=https://thedatafrog.com/en/articles/boost-python-gpu/)`

We need to add two libraries: `libdevice` and `libnvvm.so`.

In order to find it we nee to run:


In [None]:
!find / -iname 'libdevice'

/usr/local/lib/python3.7/dist-packages/jaxlib/cuda/nvvm/libdevice
/usr/local/cuda-11.0/nvvm/libdevice
/usr/local/cuda-11.1/nvvm/libdevice
/usr/local/cuda-10.0/nvvm/libdevice
/usr/local/cuda-10.1/nvvm/libdevice
find: ‘/proc/34/task/34/net’: Invalid argument
find: ‘/proc/34/net’: Invalid argument


In [None]:
!find / -iname 'libnvvm.so'

/usr/local/cuda-11.0/nvvm/lib64/libnvvm.so
/usr/local/cuda-11.1/nvvm/lib64/libnvvm.so
/usr/local/cuda-10.0/nvvm/lib64/libnvvm.so
/usr/local/cuda-10.1/nvvm/lib64/libnvvm.so
find: ‘/proc/34/task/34/net’: Invalid argument
find: ‘/proc/34/net’: Invalid argument


Finally, execute the cell below:

In [None]:
import os
os.environ['NUMBAPRO_LIBDEVICE'] = "/usr/local/lib/python3.7/dist-packages/jaxlib/cuda/nvvm/libdevice"
os.environ['NUMBAPRO_NVVM'] = "/usr/local/cuda-10.0/nvvm/lib64/libnvvm.so"

----

Day #2 of the summer course "_Introduction to High-Performance Computing in Python for Scientists!_". 


[Goethe Research Academy for Early Career Researchers (GRADE)](https://www.goethe-university-frankfurt.de/), Goethe University Frankfurt, Germany. June 2022.

---