## Compare matrix multiplication speeds for NumPy and CuPy

In [1]:
import numpy as np
import cupy as cp

n = 20000
x = np.random.rand(n,n)
y = np.random.rand(n,n)

First we'll time how long NumPy takes to matrix multiply two arays of random values. The CPU is a Intel Xeon CPU E5-2670 v3.

In [2]:
def np_mult(x,y):
    z = np.matmul(x,y)
    return z

In [3]:
%time z = np_mult(x,y)

CPU times: user 18min 54s, sys: 1min 42s, total: 20min 36s
Wall time: 25.9 s


And now for the exact same computation on a Quadro GV100 GPU using CuPy.

Note that CuPy provides GPU equivalents of NumPy functions with the same names, e.g. `cupy.matmul` and `cupy.random.rand`.

In [4]:
def cp_mult(x,y):
    z = cp.matmul(x,y)
    return z

In [5]:
x_gpu = cp.random.rand(n,n)
y_gpu = cp.random.rand(n,n)
    
%time z_gpu = cp_mult(x_gpu,y_gpu)

CPU times: user 255 ms, sys: 144 ms, total: 399 ms
Wall time: 414 ms


Wow, that is fast!

But is this a realistic comparison? Note that in this example, `x_gpu`, `y_gpu` and `z_gpu` are all in the GPU memory.

Whether this example is realistic depends on whether we're able to run all our computation on the GPU, or whether part of the computation is on the CPU and part on the GPU. In the latter case we need to introduce host (CPU) to device (GPU) data transfers. It's likely we'll have data in CPU memory which needs to be transferred to the GPU (by converting from a NumPy array to a CuPy array) for computation, and then transfer the data back to the CPU memory afterwards. 

First let's see how long it takes to move z_gpu back into CPU memory.

In [6]:
%time z = cp.asnumpy(z_gpu)

CPU times: user 2.38 s, sys: 1.5 s, total: 3.88 s
Wall time: 3.88 s


This data transfer takes a lot longer than the multiplication.

What is the time for a new version of our CuPy function, which takes NumPy arrays as input and returns a NumPy array? 

In [7]:
def cp_mult_v2(x,y):
    x_gpu = cp.array(x)
    y_gpu = cp.array(y)
    z_gpu = cp.matmul(x_gpu,y_gpu)
    z = cp.asnumpy(z_gpu)
    return z

In [8]:
%time z = cp_mult_v2(x,y)

CPU times: user 3.7 s, sys: 5.21 s, total: 8.9 s
Wall time: 8.9 s


That is much slower, but still faster than NumPy.

Finally, let's check which version of the BLAS NumPy is using.

In [9]:
np.show_config() 

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
