# Matrix Multiplication: Naive, Optimized, and CUDA Approaches

Matrix multiplication is a more complex operation than matrix addition. Given two matrices A of dimensions m ×n and B of dimensions n ×p, their product C is computed as:

### Naive Implementation of Matrix Multiplication

In [1]:
import numpy as np

A = np.array([[1, 2, 3], [4, 5, 6]])
B = np.array([[7, 8, 9], [10, 11, 12], [13, 14, 15]])

# Naive Matrix Multiplication

C = np.zeros((A.shape[0], B.shape[1]))

for i in range(A.shape[0]):
    for j in range(B.shape[1]):
        for k in range(A.shape[1]):
            C[i, j] = A[i, k] * B[k, j]

print(C)


[[39. 42. 45.]
 [78. 84. 90.]]


### Naive Matrix Multiplication
The naive approach to matrix multiplication involves three nested loops: one for rows of matrix A,
one for columns of matrix B, and one for summing the element-wise products. Here’s how you can
implement this in Python:

In [3]:
import numpy as np

def naive_matrix_multiplication(A, B):
    m,n = A.shape
    n,p = B.shape
    C = np.zeros((m,p))
    for i in range(m):
        for j in range(p):
            for k in range(n):
                C[i, j] = A[i, k] * B[k, j]
    return C

# Define two matrices
A = np.array([[1, 2], [4, 5]])
B = np.array([[7, 8 ], [10, 11]])

C = naive_matrix_multiplication(A, B)
print(C)

[[20. 22.]
 [50. 55.]]


### Optimized Matrix Multiplication
The naive matrix multiplication is inefficient because it repeatedly reads the same data from memory, causing memory latency issues. A more optimized approach involves using shared memory and leveraging matrix libraries like NumPy, which are highly optimized and make use of BLAS (Basic Linear Algebra Subprograms)

### Using NumPy for Optimized Multiplication
NumPy’s dot function is a highly optimized implementation of matrix multiplication.

In [4]:
import numpy as np

A = np.array([[1, 2], [4, 5]])
B = np.array([[7, 8 ], [10, 11]])

C = np.dot(A, B)
print(C)

[[27 30]
 [78 87]]


NumPy uses highly optimized libraries under the hood (like OpenBLAS or Intel MKL) to perform
matrix multiplication efficiently, taking advantage of low-level optimizations such as data pre-fetching
and cache reuse.

## Parallelizing Matrix Multiplication
To parallelize matrix multiplication manually, we can break the operation down by assigning different rows of matrix A and different columns of matrix B to different threads. However, using optimized libraries like NumPy is often the best approach, as they already include multi-threading and SIMD (Single Instruction, Multiple Data) optimizations.
Here’s an example of manually parallelizing matrix multiplication:

In [1]:
from concurrent.futures import ThreadPoolExecutor
import numpy as np

# Function to compute one row of matrix C
def multiply_row(A_row, B):
    return np.dot(A_row, B)

# Parallel Matrix Multiplication
def parallel_matrix_multiplication(A, B):
    m = A.shape[0]
    C = np.zeros((m,B.shape[1]))

    with ThreadPoolExecutor() as executor:
        results = list(executor.map(multiply_row, A, [B]*m))

    return np.array(results)

A = np.array([[1, 2], [4, 5]])
B = np.array([[7, 8 ], [10, 11]])

C = parallel_matrix_multiplication(A, B)
print(C)

[[27 30]
 [78 87]]
