# CuPy

Now that we've explored some low level GPU APIs with Numba let's shift gears and work with some high level array functionality in [CuPy](https://cupy.dev/).

CuPy is part of the Chainer project but has maintainers from many organisations including NVIDIA. CuPy implements the familiar Numpy API but with the backend written in CUDA C++. This allows folks who are already familiar with Numpy to get GPU acceleration out of the box quickly by just switching out an import.

In [None]:
import numpy as np
import cupy as cp
cp.cuda.Stream.null.synchronize()

Let's walk through some simple examples from this blog post https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster-4b920dda1f56

## Creating arrays

First let's create ourselves an `8GB` array both on the CPU and GPU and compare how long this takes.

In [None]:
%%timeit -r 1 -n 10
x_cpu = np.ones((1000,500,500))

In [None]:
%%timeit -n 10
x_gpu = cp.ones((1000,500,500))

cp.cuda.Stream.null.synchronize()

_Note we need to call `cp.cuda.Stream.null.synchronize()` explicitly here for our timings to be fair. By default cupy will run GPU code concurrently and the function will exit before the GPU has finished. Calling `synchronize()` makes us wait for the GPU to finish before returning._

We can see here that creating this array on the GPU is much faster than doing so on the CPU, but this time our code looks exactly the same. We haven't had to worry about kernels, theads, blocks or any of that stuff.

## Basic operations

Next let's have a look at doing some math on our arrays. We can start by multiplying every value in our arrays by `5`.

In [None]:
%%time
x_cpu *= 5

In [None]:
%%time
x_gpu *= 5

cp.cuda.Stream.null.synchronize()

Again the GPU completes this much faster, but the code stays the same.

Now let's do a couple of operations sequentially, something which would've suffered from memory transfer times in our Numba examples without explicit memory management.

In [None]:
%%time
x_cpu *= 5
x_cpu *= x_cpu
x_cpu += x_cpu

In [None]:
%%time
x_gpu *= 5
x_gpu *= x_gpu
x_gpu += x_gpu

cp.cuda.Stream.null.synchronize()

Again we can see the GPU ran that much faster even without us explicitly managing memory. This is because CuPy is handling all of this for us transparently.

## More complex operations

Now that we've tried out some operators let's dive into some numpy functions. Let's compare running a singular value decomposition on a slightly smaller array of data.

In [None]:
%%time
x_cpu = np.random.random((1000, 1000))
u, s, v = np.linalg.svd(x_cpu)

In [None]:
%%time
x_gpu = cp.random.random((1000, 1000))
u, s, v = cp.linalg.svd(x_gpu)

As we can see the GPU outperforms the CPU again with exactly the same API.

It is also interesting to note here that numpy can intelligently dispatch function calls like this. In the above example we called `cp.linalg.svd`, but we could also call `np.linalg.svd` and pass it our GPU array and numpy would inspect it and call `cp.linalg.svd` on our behalf. This makes it even easier to introduce `cupy` into your code with minimal changes.

In [None]:
%%time
x_gpu = cp.random.random((1000, 1000))
u, s, v = np.linalg.svd(x_gpu)  # Note the `np` used here

cp.cuda.Stream.null.synchronize()

## Devices

CuPy has a concept of a current device, which is the default GPU device on which the allocation, manipulation, calculation, etc., of arrays take place. Suppose ID of the current device is `0`. In such a case, the following code would create an array `x_on_gpu0` on GPU 0.

In [None]:
with cp.cuda.Device(1):
   x_on_gpu0 = cp.random.random((100000, 1000))

x_on_gpu0.device

In general, CuPy functions expect that the array is on the same device as the current one. Passing an array stored on a non-current device may work depending on the hardware configuration but is generally discouraged as it may not be performant.