<a href="https://colab.research.google.com/github/llacuesta/nvidia-python/blob/main/01_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# note: verify CUDA before running
!nvcc --version

In [None]:
!pip install numba==0.58.0

In [None]:
# note: verify numba
import numba
print(numba.__version__)

In [None]:
from numba import cuda
print(cuda.gpus)

# **Intro To CUDA Python with Numba**


Numba
- accelarating numerically focused Python for CPUs or GPUs
- function compiler, compiled functions only instead of full apps
- type-specialized instead of generic data types
- just-in-time, translating functions when they are first called for use in Jupyter notebooks
- numerically-focused using NumPy

Compiling for CPU
- Numba is enabled using `@jit` decorator
- Numba also saves the function implementation in the `.py_func` attribute

In [None]:
from numba import jit
import math

# function decorator for jit
@jit
def hypot(x, y):
  x = abs(x);
  y = abs(y);
  t = min(x, y);
  x = max(x, y);
  t = t / x;
  return x * math.sqrt(1 + t * t)

In [None]:
hypot(3.0, 4.0)
hypot.py_func(3.0, 4.0) # same as above

# Benchmarking


- can be done with Python's `timeit`
- Numba is typically faster than pure Python implementation
- Python's built-in functions (`math`) are typically faster than Numba due to overhead

In [None]:
%timeit hypot.py_func(3.0, 4.0)

In [None]:
%timeit hypot(3.0, 4.0)

In [None]:
%timeit math.hypot(3.0, 4.0)

# How Numba Works

How Numba Works
- Numba wraps the Python function, bytecode analysis is done on the function, and the types of the arguments are inferred
- types are important as certain GPUs can run very differently based on the data types
- sometimes Numba cannot translate a function (such as dictionaries), Numba wrapped functions will still run by falling back to `object mode` that does not use type-specialization
- to show errors, `nopython mode` can be forced using `nopython=True` argument to the `@jit` decorator
- `nopython mode` is recommended to maximize `jit` performance

In [None]:
hypot.inspect_types()

In [None]:
# object mode
# note: looks like by default, jit is in nopython mode
@jit
def cannot_compile(x):
    return x['key']

cannot_compile(dict(key='value'))

In [None]:
# nopython mode
@jit(nopython=False)
def cannot_compile(x):
    return x['key']

cannot_compile(dict(key='value'))

# Making `ufuncs` for the GPU

- ufuncs or universal functions are functions (in NumPy) that can take arguments of varying dimensions and operate on them on a per-element basis
- using `@vectorize` decorator, a ufunc can be optimized

(cont.)

In [None]:
from numba import vectorize

@vectorize
def add_ten(num):
  return num + 10

In [None]:
import numpy as np

a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])
c = np.arange(4*4).reshape((4,4))

nums = np.arange(10)
add_ten(a)

(cont.)

- to use CUDA on the GPU, set the `target` attribute of `@vectorize` to `'cuda'`, and provide explicit type signatures to th arguments. See [Numba available types](https://numba.pydata.org/numba-doc/dev/reference/types.html)
- this will:
  - compile a CUDA kernet to execute the ufunc in parallel over the input elements
  - allocate GPU memory for inputs and outputs
  -  execute the CUDA kernel and copy the result back from the GPU to CPU

In [None]:
@vectorize(['int64(int64, int64)'], target='cuda') # Type signature and target are required for the GPU
def add_ufunc(x, y):
    return x + y

Some considerations are discussed before using the GPU:
- GPU needs large inputs and more complex operations to compensate for the overhead due to parallelism
- copying data to and from the GPU can be expensive, keep the data there until processing complete unless otherwise necessary
- data types must be correct, use exact data types instead of larger data types (float64 > float32 > int64 > int32)
- not all Python is allowed in the GPU. Only the following are allowed:
  - `if`/`elif`/`else`
  - `while` and `for` loops
  - Basic math operators
  - Selected functions from the `math` and `cmath` modules
  - Tuples

In [None]:
%timeit np.add(b, c)   # NumPy on CPU

In [None]:
%timeit add_ufunc(b, c) # Numba on GPU

Note: For functions that does not perform element-wise operations, use `cuda.jit`

In [None]:
from numba import cuda

@cuda.jit(device=True)
def polar_to_cartesian(rho, theta):
    x = rho * math.cos(theta)
    y = rho * math.sin(theta)
    return x, y

@vectorize(['float32(float32, float32, float32, float32)'], target='cuda')
def polar_distance(rho1, theta1, rho2, theta2):
    x1, y1 = polar_to_cartesian(rho1, theta1) # We can use device functions inside our GPU ufuncs
    x2, y2 = polar_to_cartesian(rho2, theta2)

    return ((x1 - x2)**2 + (y1 - y2)**2)**0.5

# Using Drivers

- using `cuda` from Numba, communication and transfer can be reduced by copying inputs to the GPU once instead of repeated to and from
- we can do this using CUDA device arrays to hold data instead of passing data directly


In [None]:
@vectorize(['float32(float32, float32)'], target='cuda')
def add_ufunc(x, y):
    return x + y

n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x

# use device arrays
x_device = cuda.to_device(x)
y_device = cuda.to_device(y)

print(x_device)
print(x_device.shape)
print(x_device.dtype)

In [None]:
%timeit add_ufunc(x, y)

In [None]:
%timeit add_ufunc(x_device, y_device) # no copy overhead

To copy output from GPU to host, use the `out` keyword buffer.

In [None]:
out_device = cuda.device_array(shape=(n,), dtype=np.float32) # creating a device array, does not initialize unline numpy

add_ufunc(x_device, y_device, out=out_device) # set out output buffer
out_host = out_device.copy_to_host() # copy to host
print(out_host[:10])