## Problem 1 [1p]:

Let's see why GPUs are useful in deep learning. Compare matrix multiplication speed for a few matrix shapes when implemented:
1. as loops in Python
2. using np.einsum
3. using numpy on CPU
4. using pytorch on CPU
5. using pytorch on GPU

Finally, consider two square matrices, $A$ and $B$. We have 4 possibilities of multiplying them or their transpositions:
1. $AB$
2. $A^TB$
3. $AB^T$
4. $A^TB^T$

Which option is the fastest? Why?

In [3]:
import torch
import numpy as np  # cpu 
import time

# komentarz 

[0 0 0]


In [None]:
def matrix_mult_loop(A, B):
    n, k = A.shape
    k, m = B.shape

    C = torch.zeros((n, m), dtype=torch.int32)
    for i in range(n):
        for j in range(m):
            for l in range(k):
                C[i][j] += A[i][l]*B[l][j]
    return C

In [None]:
n, k, m = 100, 70, 100 
A, B = np.random.randint(0, 10, size = (n, k)), np.random.randint(0, 10, size = (k, m))
M1 = torch.tensor(A)
M2 = torch.tensor(B) 

time_table = []
# -------------------------------------------------
t = time.process_time()
C_loop = matrix_mult_loop(M1, M2)
time_table.append(time.process_time() - t)
#--------------------------------------------------
t = time.process_time()
C_einsum = np.einsum('ij,jk->ik', A, B)
time_table.append(time.process_time() - t)
#--------------------------------------------------
t = time.process_time()
C_np_cpu = np.matmul(A, B)
time_table.append(time.process_time() - t)
#--------------------------------------------------
t = time.process_time()
C_torch = torch.mm(M1, M2)
time_table.append(time.process_time() - t)
#--------------------------------------------------
# x = torch.randn(n, k).to("cuda")
# w = torch.randn(k, m).to("cuda")
# torch.cuda.synchronize()

# %time y = x.mm(w.t()); torch.cuda.synchronize()

# print(time_table)

|   Type    |    Time     |
|-----------|-------------|
|   loop    |  73.335327  | 
| np.einsum |  0.0020003  |
| np CPU    |  0.0019050  |
| torch CPU | 0.00401183  |

In [5]:
n, k, m = 100, 100, 100 
A, B = np.random.randint(0, 10, size = (n, k)), np.random.randint(0, 10, size = (k, m))
M1 = torch.tensor(A)
M2 = torch.tensor(B) 

time_table = []
t = time.process_time()
C_torch = torch.mm(M1, M2)
time_table.append(time.process_time() - t)
# ---------------------------------------------
t = time.process_time()
C_torch = torch.mm(torch.t(M1), M2)
time_table.append(time.process_time() - t)
# ---------------------------------------------
t = time.process_time()
C_torch = torch.mm(M1, torch.t(M2))
time_table.append(time.process_time() - t)
# ---------------------------------------------
t = time.process_time()
C_torch = torch.mm(torch.t(M1), torch.t(M2))
time_table.append(time.process_time() - t)
# ---------------------------------------------
print(time_table)

[0.0027436089999994806, 0.0030644999999989153, 0.001939645000000212, 0.0020961879999994437]


|   Type     |    Time    |
|------------|------------|
|  A * B     |  0.002743  | 
|  A^T * B   |  0.003064  |
|  A * B^T   |  0.001939  |
|  A^T * B^T |  0.002096  |