# CuPy Intro
* CuPy is an open-source library for easy GPU-accelerated computing in Python.
* Makes use of the most modern and optimized CUDA libraries.
* Highly compatible with Numpy, Scipy and other popular packages.
* Easy to install, supports NVIDIA and AMD GPUs.
* Provides alternative ways to express GPU kernels. 

# Exploring the available device and its attributes

In [None]:
import cupy as cp
device = cp.cuda.Device()
device.use()

print('Using device: ', cp.cuda.runtime.getDeviceProperties(device)['name'])

attributes = device.attributes
properties = cp.cuda.runtime.getDeviceProperties(device)
print('Number of SMs: ', attributes['MultiProcessorCount'])
print('Maximum threads per Block: ', properties['maxThreadsPerBlock'])
print('Shared memory size (KB): ', properties['sharedMemPerBlock']/1024)
print('Global memory size (GB): ', properties['totalGlobalMem'] / (1024**3))


Notice that:
* A GPU has typically more cores (x10) than a CPU. 
* The fast shared memory is 6 orders of magnitude smaller than the global memory. 

# CuPy Arrays
Almost identical interface with Numpy arrays. 


In [None]:
import numpy as np

# Supports all array creation routines, like zeros, ones, empty, etc
dev_a = cp.arange(10, dtype=int)
dev_b = cp.array([1, 2, 3, 4])
print(type(dev_a))

# Can be printed out of the box, though this results in device-host memory copying 
%time print(dev_a)

a = np.arange(10, dtype=int)
%time print(a)

In [None]:
# Cupy also supports all sorts of fancy indexing

# strided with start stop index
print(dev_a[1:-1:2])
# using list of indices to gather
print(dev_a[[0,2,4]])
# or with boolean list
print(dev_a[dev_a % 3 == 0])

In [None]:
# Easy to transfer arrays between the device and the host

a = np.arange(0, 20, 2)
dev_a = cp.asarray(a)

# The two arrays contain the same elements
print(np.allclose(a, dev_a))
print(cp.allclose(a, dev_a))
# Notice that (many) numpy and cupy functions can accept as arguments both numpy and cupy arrays!

In [None]:
# To get an array back to the host is simple:
b = cp.asnumpy(dev_a)
c = dev_a.get()
print(type(b), type(c))

# Cupy can  actually operate solely on numpy arrays
print(cp.allclose(b, c))


# Supported functions

Complete list here:  https://docs.cupy.dev/en/stable/reference/comparison.html

Includes Numpy and Scipy routines. 

CuPy behaves like a drop-in replacement of Numpy:

In [None]:
import numpy as np
import cupy as cp

for xp in [np, cp]:
    x = xp.arange(10)
    W = xp.ones((10, 5))
    y = xp.dot(x, W)
    print(y)

# Ways to GPU-accelerate a function: Using Numpy/ scipy equivalent operations
The easiest way is by using numpy-like array operations and the supported functions.

In [None]:
# It is trickier to time GPU kernels, because they behave asynchronously w.r.t the host
def benchmark(func, args, n_repeat=10, n_warmup=1):
    import time 
    import cupy as cp
    start_gpu = cp.cuda.Event()
    end_gpu = cp.cuda.Event()
    for i in range(n_warmup):
        out = func(*args)

    start_gpu.record()
    start_cpu = time.perf_counter()
    for i in range(n_repeat):
        out = func(*args)

    end_cpu = time.perf_counter()
    end_gpu.record()
    end_gpu.synchronize()
    t_gpu = cp.cuda.get_elapsed_time(start_gpu, end_gpu)
    t_cpu = 1000 * (end_cpu - start_cpu)
    print('Average GPU time (ms): ', t_gpu / n_repeat)
    print('Average CPU time (ms): ', t_cpu/ n_repeat)


In [None]:
# saxpy_trig, just a random compute intensive function
def saxpy_trig(x, y, a):
    return cp.exp(a * cp.sin(x) + cp.cos(y))

dev_x = cp.random.uniform(size=10000000, dtype=np.float32)
dev_y = cp.random.uniform(size=10000000, dtype=np.float32)
a = 0.5

In [None]:
benchmark(saxpy_trig, (dev_x, dev_y, a))

# Ways to GPU-accelerate a function: 2) User defined, elementwise or reduction kernels

## Elementwise kernels
When you want to compute an operation that operates on corresponding elements within the input arrays, e.g. `arr_a + arr_b` 
```python
for i in size:
    y[i] = F(a[i], b[i], ..., c1, c2, ...)
```

In [None]:
saxpy_trig_elemwise = cp.ElementwiseKernel(
    'float32 x, float32 y, float32 a',  # Input types
    'float32 z',                        # Output type
    'z = exp(a * sin(x) + cos(y))',     # operation
    'saxpy_trig_elemwise'               # Kernel name
)


In [None]:
benchmark(saxpy_trig_elemwise, (dev_x, dev_y, a))

## Reduction kernels
When you want to reduce an array to a single element, e.g. `arr_a.sum(), arr_a.max()`.
```python
y = c
for i in size:
    y = F(y, a[i], b[i], ..., c1, c2, ...)
```

In [None]:
saxpy_trig_sum_reduction = cp.ReductionKernel(
    'float32 x, float32 y, float32 a',      # input arguments
    'float32 z',                            # output arguments
    'exp(a * sin(x) + cos(y))',             # map expression
    'a + b',                                # Reduce expression
    'z = a',                                # post map expression
    '0',                                    # identity element
    'saxpy_trig_sum_reduction'              # name
)

In [None]:
benchmark(saxpy_trig_sum_reduction, (dev_x, dev_y, a))

# Ways to GPU-accelerate a function: 3) Kernel fusion
Fuses together multiple operations in a single kernel launch.

In [None]:
@cp.fuse(kernel_name='saxpy_trig_fused')
def saxpy_trig_fused(x, y, a):
    return cp.exp(a * cp.sin(x) + cp.cos(y))


In [None]:
benchmark(saxpy_trig_fused, (dev_x, dev_y, a))

# Interoperability
CuPy can be combined with a bunch of other libraries, including Numpy, mpi4py, Pytorch and ... Numba!

In [None]:
from numba import vectorize, cuda
import math

@vectorize(['float32(float32, float32, float32)'], target='cuda')
def saxpy_trig_numba_vec(x, y, a):
    return math.exp(a * math.sin(x) + math.cos(y))

In [None]:
benchmark(saxpy_trig_numba_vec, (dev_x, dev_y, a))

In [None]:
@cuda.jit
def saxpy_trig_numba_jit(x, y, a, out):
    tid = cuda.grid(1)
    if tid < x.shape[0]:
        out[tid] = math.exp(a * math.sin(x[tid]) + math.cos(y[tid]))

dev_out = cuda.device_array(len(dev_x), dtype=np.float32)

block_size = 1024
grid_size = int((len(dev_x) + block_size-1)// block_size)


In [None]:
benchmark(saxpy_trig_numba_jit[grid_size, block_size], (dev_x, dev_y, a, dev_out))

Notice that in the previous function calls we passed cupy arrays to Numba CUDA jitted functions, and even a mix of Cupy + Numba cuda arrays. 

In [None]:
print(type(dev_out))
# Zero-copy conversions 
dev_cp_out = cp.asarray(dev_out)
print(type(dev_cp_out))

# Other interesting features of CuPy include
* JIT decorator to compile unrolled, python-like kernels to CUDA kernels
* Can easily import + compile raw CUDA source files. 
* Acceleration libraries: Some of the most basic operations are accelerated using HPC CUDA backend like CUB and CuTENSOR.
* AMD support: CuPy has experimental support for AMD GPUs (ROCm). Increasing set of features supported in AMD GPUs. 