## Requirements

In [1]:
import cupy as cp
import numpy as np
import scipy

## Create data

In [2]:
nr_rows, nr_cols = 2_000, 2_000

In [3]:
A = np.random.uniform(size=(nr_rows, nr_cols))
B = np.random.uniform(size=A.shape)

## Matrix-matrix multiplication

In [4]:
%%time
C = A@B

CPU times: user 1.93 s, sys: 211 ms, total: 2.14 s
Wall time: 136 ms


In [5]:
%%time
A_dev = cp.array(A, copy=True)
B_dev = cp.array(B, copy=True)

CPU times: user 1.55 s, sys: 293 ms, total: 1.84 s
Wall time: 1.79 s


Although copying is requested, that doesn't seem to happen.

In [6]:
%%time
C_dev = A_dev@B_dev

CPU times: user 113 ms, sys: 79.7 ms, total: 193 ms
Wall time: 195 ms


In [7]:
%%time
C_dev = A_dev@B_dev

CPU times: user 3.14 ms, sys: 400 μs, total: 3.54 ms
Wall time: 2.58 ms


If possible, it helps to create the data on the GPU directive.

In [8]:
%%time
X_dev = cp.random.uniform(0.0, 1.0, size=(nr_rows, nr_cols))
Y_dev = cp.random.uniform(0.0, 1.0, size=(nr_rows, nr_cols))

CPU times: user 197 ms, sys: 29.6 ms, total: 227 ms
Wall time: 227 ms


In [9]:
%%time
Z_dev = X_dev@Y_dev

CPU times: user 2.03 ms, sys: 259 μs, total: 2.29 ms
Wall time: 1.72 ms


## Matrix power

Since there seems to be an issue with `%%timeit` in combination with `cp.linalg.matrix_power` `%%time` is used in this section.

In [10]:
D = np.random.uniform(size=(1_000, 1_000))

In [11]:
%%time
p = np.linalg.matrix_power(D, 10);

CPU times: user 1.5 s, sys: 424 ms, total: 1.93 s
Wall time: 130 ms


In [12]:
D_dev = cp.random.uniform(0.0, 1.0, size=(1_000, 1_000))

In [13]:
%%time
p_dev = cp.linalg.matrix_power(D_dev, 10);

CPU times: user 143 ms, sys: 10.6 ms, total: 153 ms
Wall time: 7.04 ms


## Singular Value Decomposition (SVD)

In [14]:
%%timeit
t = scipy.linalg.svd(D)

497 ms ± 89.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
%%timeit
t_dev = cp.linalg.svd(D_dev)

554 ms ± 18.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [16]:
%%timeit
t = scipy.linalg.qr(D)

132 ms ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [17]:
%%timeit
t_dev = cp.linalg.qr(D_dev)

The slowest run took 16.34 times longer than the fastest. This could mean that an intermediate result is being cached.
14.5 ms ± 16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
