## CuPy

CuPy is a Python package that implements NumPy arrays and methods on GPUs. CuPy positions itself as a GPU-accelerated drop-in replacement for NumPy and SciPy, but does, in fact, provide a lot more than that, including some low-level CUDA support. Note that CuPy primarily targets NVIDIA CUDA-capable devices (hence its name) but does provide experimental support for AMD ROCm devices. 

So, let's have a look at the basics of using CuPy

### GPU-accelerated NumPy

CuPy can be used as a drop-in GPU-accelerated replacement for Numpy. Using CuPy for this purpose is as easy as going through your Python code and replacing `numpy` with `cupy` (and/or `np` with `cp`), for example, by changing this:

In [None]:
import numpy

h_data = numpy.array([1, 2, 3])
h_L2 = numpy.linalg.norm(h_data)

to this:

In [None]:
import cupy

d_data = cupy.array([1, 2, 3])
d_L2 = cupy.linalg.norm(d_data)

In general, CuPy tries to preserve NumPy behavior. However, there are some differences, which are documented here: https://docs.cupy.dev/en/stable/user_guide/difference.html

### Data transfer

When we merely replace `numpy` with `cupy`, both data and calculations move to GPU. Afterall, this is the whole point of using CuPy. To move data between CPU and GPU, CuPy provides several methods:

#### Moving data to GPU

* `cupy.asarray` method moves any object that can be passed to `numpy.array` to the currently active GPU. This method accepts CuPy arrays too. This method is similar to the `cupy.array` method we used to create CuPy arrays above.
* `cupy.ndarray.set` method sets values of an existing CuPy array:

In [None]:
import cupy
import numpy

d_array = cupy.asarray([1, 2, 3])
n = 3
d_a = cupy.empty((n, n), dtype=float)
h_b = numpy.arange(numpy.multiply(*d_a.shape), dtype=float).reshape(d_a.shape)

d_a.set(h_b)

#### Moving data to CPU

* `cupy.asnumpy` method returns a NumPy array created based on the provided input. This method accepts CuPy arrays, but not only.
* `cupy.ndarray.get` method returns a NumPy array that corresponds to the CuPy array:

In [None]:
d_data = cp.array([1, 2, 3])
h_data = cp.asnumpy(d_data)

# Alternative:
h_data_too = d_data.get()

### Memory management

In general, CuPy takes care of memory issues in the background. What we need to know about memory management in CuPy, is that to mitigate overheads associated with memory allocation and CPU/GPU synchronization, CuPy uses two _memory pools_:

* Device memory pool. Used for GPU memory allocations.
* Pinned memory pool. Used during CPU-to-GPU data transfers.

### User-defined kernels

Similar to PyCUDA, CuPy allows a programmer to process data by means of three types of kernels:

1. Elementwise kernels
2. Reduction kernels
3. Custom kernels, called _raw_ in CuPy nomenclature.

Luckily, CuPy's kernel syntax is somewhat similar to that of PyCUDA.

### Elementwise kernels

Here is an example of a kernel that computes elementwise squared difference for two arrays `x` and `y`:

```python
squared_diff = cupy.ElementwiseKernel(
   'float32 x, float32 y',
   'float32 z',
   'z = (x - y) * (x - y)',
   'squared_diff')
```
The first argument is a string representation of comma-separated input arguments.
The second argument is a string representation of the (internal) output variable.
The third argument is a string representation of the body of the kernel.
The last argument is the name of the kernel.

Once the kernel is compiled, it can be used in Python code as a normal Python function:

```python
x = cupy.arange(10, dtype=numpy.float32).reshape(2, 5)
y = cupy.arange(5, dtype=numpy.float32)
squared_diff(x, y)
```

### Reduction kernels

Here is an example of a custom reduction kernel that computes L2 norm along specified axis:

```python
l2norm_kernel = cupy.ReductionKernel(
    'T x',  # input params
    'T y',  # output params
    'x * x',  # map
    'a + b',  # reduce. 'a' and 'b' are reserved variables
    'y = sqrt(a)',  # post-reduction map. 'a' is a reserved variable. 'y' is the output param above 
    '0',  # identity value (that is, axis)
    'l2norm'  # kernel name
)
```
which can be used like a normal Python function applied to a CuPy array:
```python
d_data = cupy.arange(10, dtype=numpy.float32).reshape(2, 5)
l2norm_kernel(d_data, axis=1)
```
Detailed discussion of reduction kernels is beyond the scope of this brief overview. If you're interested in reduction kernels, navigate to the corresponding page of the User's Guide: https://docs.cupy.dev/en/stable/user_guide/kernel.html#reduction-kernels.

### Raw kernels

CuPy provides a mechanism to create individual kernels in CUDA C. This approach enables fine-grained control over kernel execution parameters. Here is an example of a raw kernel:

In [None]:
add_kernel = cupy.RawKernel(r'''
extern "C" __global__
void my_add(const float* x1, const float* x2, float* y) {
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
    y[idx] = x1[idx] + x2[idx];
}
''', 'my_add')

x1 = cupy.arange(25, dtype=cupy.float32).reshape(5, 5)
x2 = cupy.arange(25, dtype=cupy.float32).reshape(5, 5)
y = cupy.zeros((5, 5), dtype=cupy.float32)

add_kernel((5,), (5,), (x1, x2, y))  # 5x1x1 grid,  5x1x1 blocks, and arguments

### Raw modules

Raw modules encapsulate several Raw kernels:

In [None]:
loaded_from_source = r'''
extern "C"{
__global__ void test_sum(const float* x1, const float* x2, float* y, unsigned int N) {
    unsigned int idx = threadIdx.x + blockDim.x * blockIdx.x;
    if (idx < N)
        y[idx] = x1[idx] + x2[idx];
}

__global__ void test_multiply(const float* x1, const float* x2, float* y, unsigned int N){
    unsigned int idx = threadIdx.x + blockDim.x * blockIdx.x;
    if (idx < N)
        y[idx] = x1[idx] * x2[idx];
}
}'''
module = cupy.RawModule(code=loaded_from_source)
ker_sum = module.get_function('test_sum')
ker_times = module.get_function('test_multiply')

# generate some data
N = 10
x1 = cupy.arange(N**2, dtype=cupy.float32).reshape(N, N)
x2 = cupy.ones((N, N), dtype=cupy.float32)
y = cupy.zeros((N, N), dtype=cupy.float32)

# apply 'test_sum' kernel
ker_sum((N,), (N,), (x1, x2, y, N**2))   # y = x1 + x2
assert cupy.allclose(y, x1 + x2)

# apply 'test_multiply' kernel
ker_times((N,), (N,), (x1, x2, y, N**2)) # y = x1 * x2
assert cupy.allclose(y, x1 * x2)

## Better kernels

Simple elementwise and reduction kernels can also be defined more easily using the `cupy.fuse()` decorator. For example, the `squared_diff` kernel that we defined in the "Elementwise kernels" section can be created with:

In [None]:
@cupy.fuse()
def squared_diff(x, y):
    return (x - y) * (x - y)

And here is an example of a simple reduction kernel:

In [None]:
@cupyp.fuse()
def sum_of_products(x, y):
    return cupy.sum(x * y, axis = -1)

These kernels can be called on CuPy arrays, NumPy arrays, and even scalars.

## JIT Raw kernels

Finally, CuPy provides a way to use the decorator approach to create Raw kernels! For that, we need `jit` module from the `cupyx` package:

In [None]:
import cupyx

@cupyx.jit.rawkernel()
def elementwise_copy(x, y, size):
    idx = jit.threadIdx.x + jit.blockIdx.x * jit.blockDim.x
    ntid = jit.gridDim.x * jit.blockDim.x
    for i in range(idx, size, ntid):
        y[i] = x[i]

# How to use
size = cupy.uint32(2 ** 22)
x = cupy.random.normal(size=(size,), dtype=cupy.float32)
y = cupy.empty((size,), dtype=cupy.float32)

elementwise_copy((128,), (1024,), (x, y, size))  # RawKernel style
assert (x == y).all()

elementwise_copy[128, 1024](x, y, size)  #  Numba style
assert (x == y).all()