# Matrix multiply speedup: 63,000 X?
Let's code it in pure python, then compare with Numpy (C) on CPU, Pytorch and Tensorflow on GPU.

Let's write down two matrices and multiply them using Python.

This will also serve as a five minute introduction to Python!

The matrices we will use are the 2x2 matrices below.
$$ 
A =\begin{bmatrix}
1 & 2 \\
3 & 4 \\
\end{bmatrix}
\text{ }
B =\begin{bmatrix}
5 & 6 \\
7 & 8 \\
\end{bmatrix}
$$
A list is a Python data type that is commonly used when working with arrays. In the frameworks we will use, one common way to define a matrix is as a list of lists.

In [2]:
A = [[1,2],[3,4]] 

In [3]:
A

[[1, 2], [3, 4]]

In [4]:
type(A)

list

In [5]:
for row in A:
  print(row)

[1, 2]
[3, 4]


In [6]:
B = [[5,6],[7,8]]

In [7]:
B[0]

[5, 6]

In [8]:
type(B[0])

list

In [9]:
B[1]

[7, 8]

In [10]:
row1_B = B[0]
row2_B = B[1]
print(row1_B)
print(row2_B)

[5, 6]
[7, 8]


In [11]:
element_1_2 = B[0][1]
element_1_2

6

In [12]:
for i in range(2):
    print(i)

0
1


In [13]:
def matrix_multiply(A,B):
    AB = [[0,0],[0,0]] # Create matrix of zeros
    for i in range(2):
        for j in range(2):
            AB[i][j] = sum(A[i][k]*B[k][j] for k in range(2))
    return AB

In [14]:
help(matrix_multiply)

Help on function matrix_multiply in module __main__:

matrix_multiply(A, B)



In [15]:
matrix_multiply(A,B)

[[19, 22], [43, 50]]

The result is 
$$ 
\begin{bmatrix}
1 & 2 \\
3 & 4 \\
\end{bmatrix}
\cdot
\begin{bmatrix}
5 & 6 \\
7 & 8 \\
\end{bmatrix}
= \begin{bmatrix}
19 & 22 \\
43 & 50 \\
\end{bmatrix}
$$

Below we write a Python function to multiply square matrices of any size. We are not interested in optimizing its runtime, rather we focus on writing a function that reads like the mathematical operation. Specifically the fact that the $i,j$ entry in the result $AB$ is given by the formula below.
$$
AB_{i,j} = \sum_{k=1}^n A_{i,k}B_{k,j}
$$

In [16]:
def matrix_multiply(A,B):
    """Multiply square matrices and return result"""
    m = len(A) 
    AB = [[0]*m for i in range(m)] # create AB as matrix of zeros then fill in results 
    
    for i in range(m):
        for j in range(m):
            AB[i][j] = sum(A[i][k]*B[k][j] for k in range(m))
    
    return AB
        

Let's create a matrix initialized with random values.

In [17]:
import random

In [18]:
m = 256 # matrix size 250
A = [[random.randrange(m) for j in range(m)] for i in range(m)]
B = [[random.randrange(m) for j in range(m)] for i in range(m)]

In [19]:
%timeit -r 1 matrix_multiply(A,B)

1.85 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


On this cpu this took about two seconds.

# Implement in Numpy
Numpy is a Python package with numerical arrays similar to those found in R, C and Fortran. So they are like the lists above but all elements must be of the same numeric type such as a float or int. Numpy sends matrix calculations to numerical libraries written in C (or if you want, to languages linkable to C, such as Fortran).   

In [20]:
import numpy as np
A_np = np.array(A)
B_np = np.array(B)

In [21]:
A_np

array([[158,  29, 146, ..., 170,  66, 245],
       [ 72,  13,  40, ..., 243, 214,  77],
       [171, 150, 212, ..., 177,  19,   2],
       ...,
       [133, 250, 153, ..., 138, 125, 102],
       [ 31, 195, 121, ...,  78, 165, 157],
       [184, 138,  29, ...,   3, 221, 199]])

In [22]:
%%timeit
np.matmul(A_np,B_np)

20.9 ms ± 34.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


So how much faster is that?
Let's say it was 20ms. The Python calculation was about 2s, that is, 2000ms. Hence Numpy was ~100X faster! 

# PyTorch

In [23]:
import torch

First let's try it on the CPU in PyTorch. It should be about the same speed as Numpy as it should call to the same C libraries.
Then we will move the data to the GPU and perform the matrix multiply there.

In [24]:
A_t = torch.tensor(A)
B_t = torch.tensor(B)

In [25]:
A_t.dtype

torch.int64

In [26]:
%%timeit
torch.matmul(A_t,B_t)

2.56 ms ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Surprisingly this was on the order of a 5-10X improvement over Numpy. We will see this is the result of a difference in data type and the impact that has on the matrix computation. Indeed, if we switch to a 64 bit float instead of a 64 bit integer data type, we will again see an improvement of 10X.

In [27]:
A_td = torch.DoubleTensor(A)
B_td = torch.DoubleTensor(B)

In [28]:
%%timeit
torch.matmul(A_td,B_td)

285 µs ± 7.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In total, we are at on the order of a ~10000X improvement over the original pure Python and naive matrix multiply implementation.

# Let's try that last computation on the GPU
First we check that GPU's are available to us.
One way to do this is from the command line using the command `nvidia-smi`.

In [29]:
!nvidia-smi

Wed Jan 25 14:46:09 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   28C    P0    25W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We can also use utilities in Pytorch.

In [30]:
torch.cuda.is_available()

True

How many (I requested 1).

In [31]:
torch.cuda.device_count()

1

What GPU?

In [32]:
torch.cuda.get_device_name()

'Tesla T4'

How much GPU memory currently used? It should be 0 because the previous PyTorch computation never moved the matrices onto the GPU and used the CPU only.

In [33]:
print('Memory allocated:', round(torch.cuda.memory_allocated()/1024**3,4), 'GB')
print('Memory reserved:   ', round(torch.cuda.memory_reserved()/1024**3,4), 'GB')

Memory allocated: 0.0 GB
Memory reserved:    0.0 GB


First we move the matrices from the CPU memory to the GPU memory.

In [34]:
A_td = A_td.to('cuda')
B_td = B_td.to('cuda')

In [35]:
print('Memory allocated:', round(torch.cuda.memory_allocated()/1024**3,4), 'GB')
print('Memory reserved:   ', round(torch.cuda.memory_reserved()/1024**3,4), 'GB')

Memory allocated: 0.001 GB
Memory reserved:    0.002 GB


In [36]:
%%timeit
# ensure that context initialization finish before you start measuring time
torch.matmul(A_td,B_td)
torch.cuda.synchronize()

518 µs ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


Depending on the matrix size, one will see ~2X improvement (see below) with a T4 GPU. When the matrix is small it may even run slower on the GPU as the many cores of the GPU are underutilized.

We add `torch.cuda.synchronize()` to tell Python to wait for the GPU computation to finish before completing the timing. Otherwise, as soon as the computation is handed off to the GPU, `timeit` reports it finished, giving incorrect timings. 


In total then, with the GPU we are at on the order of a 20000X improvement.

# Playing with Precision
We saw earlier the impact the data type had. New data types in torch are available for single and half precision. We will increase the matrix size so that the timings remain in the ms range and higher. We create two numpy arrays below of size `n` by `n`.

In [37]:
n = 10000

The cell below takes care of conversion as the parameter `n` gets passed in as a string when running the notebook as a parameterized compute job.

In [38]:
n = int(n)

In [39]:
C = np.random.rand(n,n)
D = np.random.rand(n,n)

What is the speed without GPU?

In [40]:
C_td = torch.DoubleTensor(C)
D_td = torch.DoubleTensor(D)

In [41]:
%%timeit
torch.matmul(C_td,D_td)

14.3 s ± 64.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


And on GPU?

In [42]:
C_td = C_td.to("cuda")
D_td = D_td.to("cuda")

In [43]:
%%timeit
torch.matmul(C_td,D_td)
torch.cuda.synchronize()

7.98 s ± 25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


How much bigger are these matrices?

In [44]:
print('Memory allocated:', round(torch.cuda.memory_allocated()/1024**3,4), 'GB')
print('Memory reserved:   ', round(torch.cuda.memory_reserved()/1024**3,4), 'GB')

Memory allocated: 1.4911 GB
Memory reserved:    2.2402 GB


Another way to time the GPU computation is with `torch.cuda.Event()`. Let's see how the timings compare.

In [45]:
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
torch.matmul(C_td,D_td)
end.record()

# Waits for everything to finish running
torch.cuda.synchronize()

print(start.elapsed_time(end)/1000)

7.95766943359375


We get the same result, so that is comforting!

Let's check memory now.

## Optimizations: Single and half precision, matrix sizing
Next we explore changing the datatype type and the matrix size. We will loop the computation to have it perform 1000 matrix multiplies and time that.

Next, single precision on the GPU.

In [46]:
C_ts = torch.cuda.FloatTensor(C)
D_ts = torch.cuda.FloatTensor(D)

In [47]:
print(C_ts.dtype)

torch.float32


Finally, half precision on the GPU.

In [48]:
C_th = torch.cuda.HalfTensor(C)
D_th = torch.cuda.HalfTensor(D)

In [49]:
print(C_th.dtype)

torch.float16


Let's try the computation now with these different precisions.

First single precision on the GPU. Recall, a similar double precision computation earlier was already ~20000 X faster than pure Python.

In [50]:
%%timeit
torch.matmul(C_ts,D_ts)
torch.cuda.synchronize()

468 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The earlier double precision calculation was about 7s, or 7000ms, so this was a 10X improvement.
Next, half precision on the GPU.

In [51]:
%%timeit
torch.matmul(C_th,D_th)
torch.cuda.synchronize()

108 ms ± 414 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


So here we see a 5-10X improvement over single precision.

With half precision, now we are at ~50-100X double precision on GPU.  

For a grand total of 2,000,000X the naive implementation of matrix multiply in pure Python!