# Introduction to CuPy

Negin Sobhani, Deepak Cherian, and Max Jones  
negins@ucar.edu, dcherian@ucar.edu, max@carbonplan.org

------------


## Introduction to CuPy
CuPy is an open-source GPU-accelerated array library for Python that is compatible with NumPy. 

<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/0*Qg5AIeVmg2nnP2XV.png" width="500">

CuPy uses NVIDIA CUDA to run operations on the GPU, which can provide significant performance improvements for numerical computations compared to running on the CPU. CuPy provides a NumPy-like interface for array manipulation and supports a wide range of mathematical operations, making it a powerful tool for scientific computing.

<div class="alert alert-block alert-success">
<b> In simple terms, CuPy can be described as the GPU equivalent of NumPy.</b>
</div>


### Import NumPy and CuPy

After the installation of CuPy, we can import it similar to Numpy.

In [1]:
## Import NumPy and CuPy
import cupy as cp
import numpy as np

### Creating Arrays in CuPy vs. NumPy


In [2]:
# create a 1D array with 5 elements on CPU
arr_cpu = np.array([1, 2, 3, 4, 5])
print("On the CPU: ", arr_cpu)
print (type(arr_cpu))

On the CPU:  [1 2 3 4 5]
<class 'numpy.ndarray'>


In [3]:
# create a 1D array with 5 elements on GPU
arr_gpu = cp.array([1, 2, 3, 4, 5])
print("On the GPU: ", arr_gpu)
print (type(arr_gpu))

On the GPU:  [1 2 3 4 5]
<class 'cupy.ndarray'>


You can also create multi-dimensional arrays.

In [4]:
# create a 2D array of zeros with 3 rows and 4 columns
arr_cpu = np.zeros((3, 4))
print("On the CPU: ", arr_cpu)
print (type(arr_cpu))

On the CPU:  [[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
<class 'numpy.ndarray'>


In [5]:
arr_gpu = cp.zeros((3, 4))
print("On the GPU: ", arr_gpu)
print (type(arr_gpu))

On the GPU:  [[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
<class 'cupy.ndarray'>


### Basic Operations 
CuPy provides a set of basic operations that are similar to those of NumPy. See the reference for the supported subset of NumPy API.

In [6]:
# NumPy: Create an array
numpy_a = np.array([1, 2, 3, 4, 5])

# CuPy: Create an array
cupy_a = cp.array([1, 2, 3, 4, 5])

In [7]:
# Basic arithmetic operations
numpy_b = numpy_a + 2
cupy_b = cupy_a + 2

numpy_c = numpy_a * 2
cupy_c = cupy_a * 2

numpy_d = numpy_a.dot(numpy_a)
cupy_d = cupy_a.dot(cupy_a)

# Reshaping arrays
numpy_e = numpy_a.reshape(5, 1)
cupy_e = cupy_a.reshape(5, 1)

# Transposing arrays
numpy_f = numpy_e.T
cupy_f = cupy_e.T

# Complex example: element-wise exponential and sum
numpy_g = np.exp(numpy_a) / np.sum(np.exp(numpy_a))
cupy_g = cp.exp(cupy_a) / cp.sum(cp.exp(cupy_a))

### Moving Data between Host and Device

`cupy.asarray()` can be used to move a numpy array to GPU.

In [8]:
# Move data to GPU
arr_gpu = cp.asarray(arr_cpu)

Move array from GPU to the host

Moving a device array to the host can be done by `cupy.asnumpy()` as follows:

In [9]:
# Move data back to host
arr_cpu = cp.asnumpy(arr_gpu)

We can also use `cupy.ndarray.get()`:

In [10]:
arr_cpu = arr_gpu.get()

### Device Information 

When using the nvidia-smi command, you can recognize and then utilize devices with CuPy. This capability becomes significantly important when your code is designed to harness the power of multiple GPUs. By default, operations are executed on Device 0. Below is an example of how to explicitly direct an operation to execute on Device 0:

In [11]:
cupy_g.device

<CUDA Device 0>

### Moving Data between Host and Device


In [12]:
# Move data to GPU
arr_gpu = cp.asarray(arr_cpu)

# Move data back to host
arr_cpu = cp.asnumpy(arr_gpu)

## CuPy Implemented Functions
CuPy has equivalents for many of the commonly used NumPy functions, but not all. Here is a short list of the NumPy function with it's CuPy equivalent. You can see almost all of CuPy's functions will use the same function call as its NumPy equivalent.


## CuPy vs NumPy: Speed Comparison

In [13]:
import time

# create two 1000x1000 matrices
n = 1000

a_np = np.random.rand(n, n)
b_np = np.random.rand(n, n)

a_cp = cp.asarray(a_np)
b_cp = cp.asarray(b_np)

# perform matrix multiplication with NumPy and time it
start_time = time.time()
c_np = np.dot(a_np, b_np)
end_time = time.time()

numpy_time = end_time - start_time
print("NumPy time:", numpy_time, "seconds")

# perform matrix multiplication with CuPy and time it
start_time = time.time()
c_cp = cp.dot(a_cp, b_cp)
cp.cuda.Stream.null.synchronize()  # wait for GPU computation to finish
end_time = time.time()

cupy_time = end_time - start_time

print("CuPy time:", cupy_time, "seconds")
print("CuPy provides a", round(numpy_time / cupy_time, 2), "x speedup over NumPy.")

NumPy time: 0.19883275032043457 seconds
CuPy time: 4.461655616760254 seconds
CuPy provides a 0.04 x speedup over NumPy.


Now, let's make the same comparison with other array sizes:

In [14]:
for n in [10, 100, 1000, 5000, 10000]:
    print("n =", n)

    # create two nxn matrices
    a_np = np.random.rand(n, n)
    b_np = np.random.rand(n, n)
    a_cp = cp.asarray(a_np)
    b_cp = cp.asarray(b_np)

    # perform matrix multiplication with NumPy and time it
    start_time = time.time()
    c_np = np.dot(a_np, b_np)
    end_time = time.time()
    numpy_time = end_time - start_time

    # perform matrix multiplication with CuPy and time it
    start_time = time.time()
    c_cp = cp.dot(a_cp, b_cp)
    cp.cuda.Stream.null.synchronize()  # wait for GPU computation to finish
    end_time = time.time()
    cupy_time = end_time - start_time

    # print the speedup
    print("CuPy provides a", round(numpy_time / cupy_time,2), "x speedup over NumPy.\n")

n = 10
CuPy provides a 0.4 x speedup over NumPy.

n = 100
CuPy provides a 0.58 x speedup over NumPy.

n = 1000
CuPy provides a 45.95 x speedup over NumPy.

n = 5000
CuPy provides a 64.83 x speedup over NumPy.

n = 10000
CuPy provides a 67.78 x speedup over NumPy.

