# The <code>cupy</code> Package

Multiprocessing is a topic in this course, the focus of which was intended to be multiprocessing with the CPU.  Given the advances in the <code>cupy</code> package recently, a quick tangent to cover this method of multiprocessing on the Graphic Processing Units (GPUs) is worthwhile because some computers are equipped with powerful graphics cards with a multitude of onboard processors.  Although each of these processors is much less powerful than the cores on our CPUs, their power lies in their multitude.  Here are the specs on my NVIDIA GeForce RTX 3090 graphics card:

- 10496 computing cores
- 24GB memory

There is strength in the number of processors.  The GPU memory capacity, 24GB, is less than I have available to my CPU (64GB) and so this is a relative weakness.

NVIDA's <code>cuda</code> is used by the <code>cupy</code> package and so only cuda-enabled NVIDIA graphcis card can run <code>cupy</code>: a list of those can be found [here](https://developer.nvidia.com/cuda-gpus).  You can check to see if your graphics card is listed here.  If it is, then you can run <code>cupy</code>.

The programming interface for <code>cupy</code> currently looks just like the <code>numpy</code> interface.  To use <code>cupy</code> use this import statement,

``` python
import cupy as cp
```

and then use the <code>cp</code> alias as you would <code>np</code>.

for example see some of the commands below.

# Installing <code>cupy</code>

I recommend creating a new anaconda environment for <code>cupy</code> by following these steps:

- Open an Anaconda prompt as administrator
- Execute the command <code>conda create --name __env_name__ cupy anaconda</code> where __env_name__ is the name you want to give this new environment.  Since its feature is <code>cupy</code> that is what I would call it.  The <code>cupy</code> argument causes <code>cupy</code> and the <code>anaconda</code> argument cause hundreds of the default anaconda packages to be downloaded.
- Follow the default prompts to download and install all the packages.

You will also need to install <code>cuda</code> on your computer.  See this [link](https://developer.nvidia.com/cuda-downloads).

# Let's Get to Work!

In [3]:
import cupy as cp

In [4]:
cp.arange(4).reshape(2,2)

array([[0, 1],
       [2, 3]])

In [5]:
cp.zeros((3,3)).astype(cp.float32)

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]], dtype=float32)

In [7]:
cp.ones((3,3)).astype(cp.float16)

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]], dtype=float16)

The advantage of <code>cupy</code> is that it used the many GPU processors and the variable data are structured so that computations can be distributed among the GPU processors and portions of a computation carried out in parallel fashion.

In [8]:
import time
import numpy as np

In [24]:
size=10000

In [25]:
''' Load numpy variables '''
p = np.load(f'data/mnist1_{size}.npy').astype(np.float64)
q = np.load(f'data/mnist2_{size}.npy').astype(np.float64)
assert p.shape[0] == q.shape[0]
assert p.shape[1] == 784
assert q.shape[1] == 784

n = p.shape[0]
pixels = p.shape[1]

In [26]:
start = time.time()

result_ein = np.sqrt(np.einsum('ij,ij->i',q,q)[:,np.newaxis] - 2*q@p.T + np.einsum('ij,ij->i',p,p))

print(f'Exec. time: {time.time() - start} for {n}x{n}')
print(result_ein[0,:5])

Exec. time: 2.1572675704956055 for 10000x10000
[ 9.34416908 10.68585075  9.54365603  8.2039741   8.32884035]


In [27]:
''' Load cupy variables '''

p = cp.load(f'data/mnist1_{size}.npy').astype(np.float64)
q = cp.load(f'data/mnist2_{size}.npy').astype(np.float64)
assert p.shape[0] == q.shape[0]
assert p.shape[1] == 784
assert q.shape[1] == 784

n = p.shape[0]
pixels = p.shape[1]

In [28]:
start = time.time()
#result = np.sqrt(np.diag(q@q.T) -2*q@p.T + np.diag(p@p.T).reshape(-1,1))
# for i in range(n):
#    for j in range(n):
#        result[i][j] = np.sqrt(q[i]@q[i]-2*p[j]@q[i]+p[j]@p[j])

result_ein = np.sqrt(np.einsum('ij,ij->i',q,q)[:,np.newaxis] - 2*q@p.T + np.einsum('ij,ij->i',p,p))

print(f'Exec. time: {time.time() - start} for {n}x{n}')
print(result_ein[0,:5])

Exec. time: 0.07599401473999023 for 10000x10000
[ 9.34416908 10.68585075  9.54365603  8.2039741   8.32884035]


Memory problems are still possible with <code>cupy</code> and, perhaps more likely, so programming in a manner that minimizes the memory footprint is important.

First, use lower precision data types <code>cp.float16</code> seems to have no adverse effects in <code>cupy</code> whereas this approach is actually slower then using <code>np.float32</code> in <code>numpy</code>.

There are multiple ways to break computations into chunks in order to stay within memory constraints.