# CuPy - Numpy for NVIDIA GPU
### Overview

CuPy is part of the Chainer project but has maintainers from many organisations including NVIDIA. CuPy implements the familiar Numpy API but with the backend written in CUDA C++. This allows folks who are already familiar with Numpy to get GPU acceleration out of the box quickly by just switching out an import.

* #### open-source matrix library accelerated with NVIDIA CUDA



* #### CuPy uses CUDA-related libraries including:
    * cuBLAS 
    * cuDNN
    * cuRand
    * cuSolver
    * cuSPARSE
    * cuFFT
    * NCCL

![](graphics/cupy_stack.png)

* #### CuPy insterface is highly compatible with NumPy
* #### Comparison APIs NumPy CuPy: https://docs-cupy.chainer.org/en/stable/reference/comparison.html
* #### Possibility to define CUDA kernels to optimize programs
   
### General Performance comparison
![](graphics/numpy_cupy.png)

# CUDA Array Interface

Because moving data from the CPU to GPU is expensive we want to keep as much data located on the GPU as possible at all times.

Sometimes in our workflow we want to change which tool we are using too. Perhaps we load an array of data with `cupy` but we want to write a custom CUDA kernel with `numba`. Or perhaps we want to switch to using a Deep Learning framework like `pytorch`. 

When any of these libraries load data onto the GPU the array in memory is pretty much the same, the differences between a cupy `ndarray` and a numba `DeviceNDArray` just boil down to how that array is wrapped and hooked into Python.

Thankfully with utilities like [DLPack](https://github.com/dmlc/dlpack) and [__ cuda_array__interface __](https://numba.pydata.org/numba-doc/dev/cuda/cuda_array_interface.html) we can convert from one type to another without modifying the data on the GPU. We just create a new Python wrapper object and transfer all the device pointers accross.

Ensuring compatibility between popular GPU Python libraries is one of the core goals of the RAPIDS community.

![](graphics/array-interface.png)

We start off my creating an array with cupy.

# Why CuPy?
* #### Need for NumPy compatible GPU array library
* #### Comparison with other libraries
![](graphics/cupy_comp.png)

# Future of CuPy
* #### Support GPU in Python with minimal adjustments
* #### High compatibility with other libraries
* #### Covering not only Numpy but also SciPy
* #### Enabling GPU acceleration with minimal effort

In [None]:
#Install
# pip install cupy-cuda100

### CuPy ndarray
* GPU alternative to numpy.ndarray
* content allocated on the device memory

In [1]:
import cupy as cp
import numpy as np

In [2]:
!nvidia-smi

Fri Nov 11 07:46:59 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA Quadro R...  On   | 00000000:00:05.0  On |                  N/A |
| 30%   33C    P8    14W / 125W |    387MiB /  7982MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
x_gpu=cp.array([1,2,3])

In [4]:
x_gpu

array([1, 2, 3])

In [5]:
!nvidia-smi

Fri Nov 11 07:47:14 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA Quadro R...  On   | 00000000:00:05.0  On |                  N/A |
| 30%   35C    P0    39W / 125W |    512MiB /  7982MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [6]:
cp.cuda.Device(0).mem_info

(7832731648, 8370061312)

In [None]:
cp.cuda.Device(0).use()

In [9]:
x_gpu.device

<CUDA Device 0>

In [10]:
x_cpu=np.array([1,2,3])
x_gpu=cp.asarray(x_cpu)

In [11]:
cp.asnumpy(x_gpu)

array([1, 2, 3])

### Current device
* default device on which the calculation takes place

### Transfering data to and from GPU
* cupy.asarray() move any object that can be passed to numpy.array() to current device
* equivalent to cupy.array(arr, dtype, copy=False)

# Benchmark CuPy vs NumPy

In [12]:
import numpy as np
import cupy as cp
cp.cuda.Stream.null.synchronize()

Let's walk through some simple examples from this blog post https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster-4b920dda1f56

## Creating arrays


In [13]:
%%time
x_cpu=np.ones((1000,500,500))

CPU times: user 86.7 ms, sys: 808 ms, total: 895 ms
Wall time: 897 ms


In [31]:
%%time
x_gpu=cp.ones((1000,500,500))
cp.cuda.Stream.null.synchronize()

CPU times: user 334 ms, sys: 43 ms, total: 377 ms
Wall time: 407 ms


## Basic operations

Next let's have a look at doing some math on our arrays. We can start by multiplying every value in our arrays by `5`.

In [32]:
%%time
x_cpu*=5

CPU times: user 259 ms, sys: 0 ns, total: 259 ms
Wall time: 263 ms


In [39]:
%%timeit
x_gpu*=5
cp.cuda.Stream.null.synchronize()

UnboundLocalError: local variable 'x_gpu' referenced before assignment

In [42]:
%%time
x_cpu=np.random.random((1000,1000))
u,s,v=np.linalg.svd(x_cpu)

CPU times: user 2.77 s, sys: 1.61 s, total: 4.39 s
Wall time: 608 ms


In [43]:
%%time
x_gpu=cp.random.random((1000,1000))
u,s,v=cp.linalg.svd(x_gpu)

CPU times: user 513 ms, sys: 187 ms, total: 700 ms
Wall time: 700 ms


## More complex operations

Now that we've tried out some operators let's dive into some numpy functions. Let's compare running a singular value decomposition on a slightly smaller array of data.

# Exercise

In [3]:
import cupy as cp

**1. Create the input data array with the numbers `1` to `500_000_000`.** 

In [8]:
gpu_data=cp.arange(1,500_000_000,1)

In [9]:
!nvidia-smi

Fri Nov 11 07:50:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA Quadro R...  On   | 00000000:00:05.0  On |                  N/A |
| 30%   34C    P0    38W / 125W |   4312MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [7]:
del gpu_data

In [8]:
!nvidia-smi

Thu Nov 10 15:22:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA Quadro R...  On   | 00000000:00:05.0  On |                  N/A |
| 30%   38C    P8     5W / 125W |   4270MiB /  7982MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!sudo kill -20377

**2. Calculate how large the array is in GB with `nbytes`** _Hint: GB is `1e9`_

In [6]:
gpu_data.nbytes

3999999992

In [10]:
memorypool=cp.get_default_memory_pool()

In [11]:
memorypool.used_bytes()

0

**3. How many dimensions does the array have?**

In [11]:
gpu_data.ndim

1

**4. How many elements does the array have?**

In [13]:
gpu_data.shape

(499999999,)

**5. What is the shape of the array?**

In [16]:
?cp.linspace

**6. Create a new array with `5_000_000` elements representing the linear space of `0` to `1000`.**

In [14]:
data=cp.linspace(0,1000,5_000_000)

In [15]:
data[:20]

array([0.    , 0.0002, 0.0004, 0.0006, 0.0008, 0.001 , 0.0012, 0.0014,
       0.0016, 0.0018, 0.002 , 0.0022, 0.0024, 0.0026, 0.0028, 0.003 ,
       0.0032, 0.0034, 0.0036, 0.0038])

**7. Create a random array that is `10_000` by `5_000`.**

In [17]:
data=cp.random.normal(size=(10_000,5_000))

In [18]:
data

array([[ 0.1579573 ,  3.11995764,  0.82504753, ..., -0.90379534,
         0.61172377,  0.3897284 ],
       [-0.70753572, -0.44547336,  0.25239969, ..., -0.67596071,
         0.03140169,  0.23306418],
       [ 1.90213577, -0.23399382,  0.7387471 , ..., -1.80823866,
         0.51924914, -0.88093739],
       ...,
       [ 0.57333766,  0.70352475, -1.44767314, ..., -0.9647648 ,
        -0.31165051,  2.08798352],
       [-0.0380893 , -0.12240508, -1.34853838, ..., -0.48717084,
         0.61094759, -0.02036162],
       [-0.00628658, -0.24089984, -0.95427279, ..., -1.38525948,
         0.13773524, -0.36072051]])

**8. Sort that array.**

In [19]:
cp.sort(data)

array([[-3.39107571, -3.29521416, -3.13987372, ...,  3.11995764,
         3.34230042,  3.42442149],
       [-3.24115845, -3.18329232, -3.16890568, ...,  3.45503125,
         3.47908322,  3.66857942],
       [-3.66182806, -3.56602301, -3.29821417, ...,  3.27145597,
         3.29765288,  3.36315015],
       ...,
       [-3.48921368, -3.26887593, -3.26687119, ...,  3.31283098,
         3.46444187,  3.4709336 ],
       [-3.58337798, -3.43428127, -3.40393759, ...,  3.28363236,
         3.30588332,  3.58918126],
       [-3.04898224, -3.03400481, -3.02434857, ...,  3.34298772,
         3.38221875,  3.59776512]])

**Extra  Reshape the array to have one dimension of length `5`**

In [22]:
data=data.reshape(10_000,1000,5)

In [23]:
data.shape

(10000, 1000, 5)

In [27]:
data=cp.random.uniform(size=100).reshape(10,10)

In [34]:
cp.matmul(data,cp.linalg.inv(data))

array([[ 1.00000000e+00, -4.44089210e-16,  0.00000000e+00,
         2.22044605e-16, -4.44089210e-16, -1.66533454e-16,
         2.22044605e-16,  6.66133815e-16, -1.38777878e-16,
         1.11022302e-16],
       [-7.21644966e-16,  1.00000000e+00, -1.76941795e-16,
         2.22044605e-16, -3.88578059e-16, -2.22044605e-16,
         1.66533454e-16,  3.33066907e-16, -1.38777878e-16,
         4.16333634e-16],
       [-5.55111512e-16, -5.55111512e-17,  1.00000000e+00,
         2.22044605e-16, -4.44089210e-16, -1.11022302e-16,
         1.11022302e-16,  3.88578059e-16,  0.00000000e+00,
         1.74339709e-16],
       [-9.43689571e-16,  0.00000000e+00, -2.77555756e-16,
         1.00000000e+00, -5.55111512e-16, -2.22044605e-16,
         1.11022302e-16,  6.10622664e-16, -1.11022302e-16,
         3.88578059e-16],
       [-7.77156117e-16, -2.22044605e-16, -4.02455846e-16,
         2.22044605e-16,  1.00000000e+00, -2.22044605e-16,
         2.22044605e-16,  3.88578059e-16,  5.55111512e-17,
         2.