# Matrix multiply speedup: X 63,000?
Let's code it in pure python, then compare with Numpy (C) on CPU, Pytorch and Tensorflow on GPU.

Let's write down two matrices and multiply them using Python.

This will also serve as a five minute introduction to Python!

The matrices we will use are the 2x2 matrices below.
$$ 
A =\begin{bmatrix}
1 & 2 \\
3 & 4 \\
\end{bmatrix}
\text{ }
B =\begin{bmatrix}
5 & 6 \\
7 & 8 \\
\end{bmatrix}
$$
A list is a Python data type that is commonly used when working with arrays. In the frameworks we will use, one common way to define a matrix is as a list of lists.

In [5]:
A = [[1,2],[3,4]] 

In [26]:
A

[[1, 2], [3, 4]]

In [27]:
type(A)

list

In [13]:
for row in A:
  print(row)

[1, 2]
[3, 4]


In [17]:
B = [[5,6],[7,8]]

In [28]:
B[0]

[5, 6]

In [29]:
type(B[0])

list

In [23]:
B[1]

[7, 8]

In [24]:
row1_B = B[0]
row2_B = B[1]
print(row1_B)
print(row2_B)

[5, 6]
[7, 8]


In [32]:
element_1_2 = B[0][1]
element_1_2

6

In [33]:
for i in range(2):
    print(i)

0
1


In [44]:
def matrix_multiply(A,B):
    AB = [[0,0],[0,0]] # Create matrix of zeros
    for i in range(2):
        for j in range(2):
            AB[i][j] = sum(A[i][k]*B[k][j] for k in range(2))
    return AB

In [48]:
help(matrix_multiply)

Help on function matrix_multiply in module __main__:

matrix_multiply(A, B)
    Multiply square matrices and return result



In [49]:
matrix_multiply(A,B)

[[19, 22], [43, 50]]

The result is 
$$ 
\begin{bmatrix}
1 & 2 \\
3 & 4 \\
\end{bmatrix}
\cdot
\begin{bmatrix}
5 & 6 \\
7 & 8 \\
\end{bmatrix}
= \begin{bmatrix}
19 & 22 \\
43 & 50 \\
\end{bmatrix}
$$

In [102]:
def matrix_multiply(A,B):
    """Multiply square matrices and return result"""
    m = len(A) 
    AB = [[0]*m for i in range(m)] # create matrix of zeros
    
    for i in range(m):
        for j in range(m):
            AB[i][j] = sum(A[i][k]*B[k][j] for k in range(m))
    
    return AB
        

In [96]:
import random


In [90]:
random.randrange(m) #random number between 0 and m

778

In [112]:
m = 850 # matrix size 850
A = [[random.randrange(m) for j in range(m)] for i in range(m)]
B = [[random.randrange(m) for j in range(m)] for i in range(m)]

In [120]:
%%timeit
matrix_multiply(A,B)

1min 3s ± 138 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Implement in Numpy
Numpy is a Python package with numerical arrays similar to those found in R, C and Fortran. So they are like the lists above but all elements must be of the same numeric type such as a float or int. Numpy sends matrix calculations to numerical libraries written in C (or if you want, to languages linkable to C, such as Fortran).   

In [117]:
import numpy as np
A_np = np.array(A)
B_np = np.array(B)

In [118]:
A_np

array([[609, 138, 672, ..., 376, 522, 701],
       [504, 169, 372, ..., 809,  87, 324],
       [344, 191, 243, ..., 806, 660, 206],
       ...,
       [330, 542,  73, ..., 333, 498, 497],
       [651, 455, 729, ..., 670, 222, 631],
       [518, 483, 698, ..., 849, 641, 644]])

In [124]:
%%timeit
np.matmul(A_np,B_np)

605 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


So how much faster is that?
Let's say it was 600ms. The Python calculation was about 60 seconds, or 60000 ms. Hence Numpy was 1000 X faster! 

# PyTorch
First let's try it on the CPU. It should be about the same speed as Numpy as it should call to the same C libraries.
Then we will move the data to the GPU and perform the matrix multiply there.


In [122]:
import torch

In [123]:
A_t = torch.tensor(A)
B_t = torch.tensor(B)

In [134]:
%%timeit
torch.matmul(A_t,B_t)

72.7 ms ± 398 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Surprisingly this was a 10X improvement over Numpy. As far as I know there is no reason to expect such an improvement in general. Performance for matrix multiply strongly depends on the [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) library to which the NumPy and Pytorch implementation relies on.

# Let's try on the GPU
First we check that GPU's are available to us.
One way to do this is from the command line using the command `nvidia-smi`.

In [156]:
!nvidia-smi

Tue Nov  2 13:21:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA RTX A6000    Off  | 00000000:1B:00.0 Off |                  Off |
| 30%   30C    P8    28W / 300W |   6503MiB / 48685MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We can also use utilities in Pytorch.

In [157]:
torch.cuda.is_available()

True

How many (I requested 1).

In [130]:
torch.cuda.device_count()

1

What GPU?

In [139]:
torch.cuda.get_device_name()

'NVIDIA RTX A6000'

How much GPU memory currently used?

In [160]:
print('Memory allocated:', round(torch.cuda.memory_allocated()/1024**3,4), 'GB')
print('Memory reserved:   ', round(torch.cuda.memory_reserved()/1024**3,4), 'GB')

Memory allocated: 0.0054 GB
Memory reserved:    0.0215 GB


First we move the matrices from the CPU memory to the GPU memory.

In [165]:
A_t.to('cuda')
B_t.to('cuda')

tensor([[494, 496, 435,  ...,  90, 647, 319],
        [499, 102, 194,  ..., 470, 708, 698],
        [409, 204, 714,  ..., 427, 646, 234],
        ...,
        [686, 597, 451,  ..., 411,  93, 839],
        [386,  74,  45,  ..., 418, 513, 445],
        [292, 766, 539,  ..., 647, 390, 508]], device='cuda:0')

In [159]:
print('Memory reserved:   ', round(torch.cuda.memory_reserved()/1024**3,4), 'GB')

Cached:    0.0215 GB


In [166]:
%%timeit
# ensure that context initialization finish before you start measuring time
torch.matmul(A_t,B_t)


76.3 ms ± 768 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


No speed up? We need bigger matrices. We create two numpy arrays below of size `n` by `n`.

In [173]:
n = 20000
C = np.random.rand(n,n)
D = np.random.rand(n,n)
C_t = torch.tensor(C).to("cuda")
D_t = torch.tensor(D).to("cuda")
torch.cuda.synchronize()

How much bigger are these matrices?

In [176]:
print('Memory reserved:   ', round(torch.cuda.memory_reserved()/1024**3,4), 'GB')

Memory reserved:    8.2207 GB


What is the speed without GPU?

In [174]:
%%timeit
np.matmul(C,D)

41.5 s ± 81.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


And on GPU?

In [177]:
%%timeit
torch.matmul(C_t,D_t)
torch.cuda.synchronize()

28.5 s ± 87.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
