# The <code>cupy</code> Package

Multiprocessing is a topic in this course, the focus of which was intended to be multiprocessing with the CPU.  Given the advances in the <code>cupy</code> package recently, a quick tangent to cover this method of multiprocessing on the Graphic Processing Units (GPUs) is worthwhile because some computers are equipped with powerful graphics cards with a multitude of onboard processors.  Although each of these processors is much less powerful than the cores on our CPUs, their power lies in their multitude.  Here are the specs on my NVIDIA GeForce RTX 3090 graphics card:

- 10496 computing cores
- 24GB memory

There is strength in the number of processors.  The GPU memory capacity, 24GB, is less than I have available to my CPU (64GB) and so this is a relative weakness.

NVIDA's <code>cuda</code> is used by the <code>cupy</code> package and so only cuda-enabled NVIDIA graphcis card can run <code>cupy</code>: a list of those can be found [here](https://developer.nvidia.com/cuda-gpus).  You can check to see if your graphics card is listed here.  If it is, then you can run <code>cupy</code>.

The advantage of <code>cupy</code> is that it used the many GPU processors and the variable data are structured so that computations can be distributed among the GPU processors and portions of a computation carried out in parallel fashion.

The programming interface for <code>cupy</code> currently looks just like the <code>numpy</code> interface.  To use <code>cupy</code> use this import statement,

``` python
import cupy as cp
```

and then use the <code>cp</code> alias as you would <code>np</code>.

See some of the <code>cupy</code> commands below, for example.

# Installing <code>cupy</code>

I recommend creating a new anaconda environment for <code>cupy</code> by following these steps:

- Open an Anaconda prompt as administrator
- Execute the command <code>conda create --name __env_name__ cupy anaconda</code> where __env_name__ is the name you want to give this new environment.  Since its feature is <code>cupy</code> that is what I would call it.  The <code>cupy</code> argument causes <code>cupy</code> and the <code>anaconda</code> argument cause hundreds of the default anaconda packages to be downloaded.
- Follow the default prompts to download and install all the packages.

You will also need to install <code>cuda</code> on your computer.  See this [link](https://developer.nvidia.com/cuda-downloads).

In [None]:
import cupy as cp

In [None]:
cp.arange(4).reshape(2,2)

In [None]:
cp.zeros((3,3)).astype(cp.float32)

In [None]:
cp.ones((3,3)).astype(cp.float16)

# Let's Get to Work!

Let's compare the speed of <code>cupy</code> with <code>numpy</code> on the MNIST distance computation.

In [8]:
import time
import numpy as np

In [24]:
size=10000

In [33]:
''' Load numpy variables '''
p = np.load(f'data/mnist1_{size}.npy').astype(np.float32)
q = np.load(f'data/mnist2_{size}.npy').astype(np.float32)
assert p.shape[0] == q.shape[0]
assert p.shape[1] == 784
assert q.shape[1] == 784

n = p.shape[0]
pixels = p.shape[1]

In [34]:
start = time.time()

result_ein = np.sqrt(np.einsum('ij,ij->i',q,q)[:,np.newaxis] - 2*q@p.T + np.einsum('ij,ij->i',p,p))

print(f'Exec. time: {time.time() - start} for {n}x{n}')
print(result_ein[0,:5])

Exec. time: 1.0540456771850586 for 10000x10000
[ 9.344169 10.68585   9.543656  8.203975  8.32884 ]


In [35]:
''' Load cupy variables '''

p = cp.load(f'data/mnist1_{size}.npy').astype(cp.float16)
q = cp.load(f'data/mnist2_{size}.npy').astype(cp.float16)
assert p.shape[0] == q.shape[0]
assert p.shape[1] == 784
assert q.shape[1] == 784

n = p.shape[0]
pixels = p.shape[1]

In [36]:
start = time.time()

result_ein = cp.sqrt(cp.einsum('ij,ij->i',q,q)[:,cp.newaxis] - 2*q@p.T + cp.einsum('ij,ij->i',p,p))

print(f'Exec. time: {time.time() - start} for {n}x{n}')
print(result_ein[0,:5])

Exec. time: 0.041013240814208984 for 10000x10000
[ 9.34 10.69  9.54  8.2   8.33]


Memory problems are still possible with <code>cupy</code> and, perhaps more likely, so programming in a manner that minimizes the memory footprint is important.

First, use lower precision data types <code>cp.float16</code> seems to have no adverse effects in <code>cupy</code> whereas this approach is actually slower then using <code>np.float32</code> in <code>numpy</code>.

There are multiple ways to break computations into chunks to stay within memory constraints.

- Use <code>np.split()</code>/<code>cp.split()</code>
- Break computations with multiple operands into smaller computations and carefully delete variables when they are no longer needed

In [37]:
x = np.arange(36).reshape(9,4)
x

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31],
       [32, 33, 34, 35]])

In [38]:
np.split(x, 3)

[array([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]]),
 array([[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]),
 array([[24, 25, 26, 27],
        [28, 29, 30, 31],
        [32, 33, 34, 35]])]

This example shows how to break a quadratic matrix function of this form,

$x^T Q x + c^T x$,

into smaller chunks.

In [40]:
x = np.random.random(20)
Q = np.random.random((20,20))
c = np.arange(20)
x.shape, Q.shape, c.shape

((20,), (20, 20), (20,))

The entire equation at once.

In [41]:
x@Q@x + c@x

134.95202466724632

In [42]:
result = x@Q
del Q
result = result@x
result += c@x
del c, x
result

134.95202466724632