# Matrix Addition
Matrix addition is an element-wise operation, which makes it highly parallelizable. Each element in
the resulting matrix can be computed independently. Using Python, we can parallelize this operation
using the concurrent.futures module, multiprocessing, or even GPU acceleration using CUDA.

**Example Code for Sequential Matrix Addition** : Before diving into parallelism, let’s look at how matrix  addition is implemented sequentially:

In [1]:
import numpy as np
from pandas.tests.plotting.test_backend import restore_backend

# Define two matrix
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

C = A + B
print(C)

[[ 2  4  6]
 [ 8 10 12]
 [14 16 18]]


### Parallel Matrix Addition Using Threads
For parallelizing the matrix addition, we can divide the matrix
into rows or blocks and assign each portion to a separate thread for computation. Here’s how you can
do it using the `ThreadPoolExecutor` from the `concurrent.futures` module:

In [2]:
import numpy as np
from concurrent.futures import ThreadPoolExecutor

# Function to add two matrices row by row
def add_rows(row_a, row_b):
    return row_a + row_b

# Matrix addition with threading
def parallel_matrix_addition(A,B):
    with ThreadPoolExecutor() as executor:
        result = list(executor.map(add_rows, A, B))
    return np.array(result)

A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Perform parallel matrix addition
C = parallel_matrix_addition(A, B)
print(C)

[[ 2  4  6]
 [ 8 10 12]
 [14 16 18]]


## Matrix Addition using CUDA
We can further optimize matrix addition by leveraging CUDA for GPU
acceleration. CUDA enables the use of GPUs to perform parallel matrix operations on a massive scale,
making it much faster for large matrices. Below is an example of using CUDA for matrix addition:

In [15]:
import numpy as np
from numba import cuda

# Define the matrix size
N =3

# cuda Kernel for Matrix addition
@cuda.jit
def matrix_addition(A, B, C):
    i, j = cuda.grid(2)
    if i < C.shape[0] and j < C.shape[1]:
        C[i, j] = A[i, j] + B[i, j]

# Initialize matrices
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
C = np.zeros((N, N), dtype=np.float32)

# Define Grid and Block Sizes
threads_per_block = (16,16)
blocks_per_grid_x = int(np.ceil(A.shape[0] / threads_per_block[0])) #1
blocks_per_grid_y = int(np.ceil(A.shape[1] / threads_per_block[1])) #1
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)    #(1,1)

# Call the cuda kernel
matrix_addition[ blocks_per_grid, threads_per_block ](A, B, C)

print(C)

[[ 2.  4.  6.]
 [ 8. 10. 12.]
 [14. 16. 18.]]




### Thread Grid Visualization
We have:
* 1 block containing 16×16 threads (256 total)
* 3×3 matrix (9 elements)
```
Thread Block (16×16 threads) - Only top-left 3×3 are used:
Thread Coordinates (i,j) within the block:
(0,0) (0,1) (0,2) (0,3) ... (0,15)
(1,0) (1,1) (1,2) (1,3) ... (1,15)
(2,0) (2,1) (2,2) (2,3) ... (2,15)
(3,0) (3,1) (3,2) (3,3) ... (3,15)
 ...   ...   ...   ...  ...   ...
(15,0)(15,1)(15,2)(15,3)... (15,15)
```
* Matrix Mapping : Each thread gets global coordinates via cuda.grid(2). Since we have only 1 block, the mapping is direct:
    * i = thread_x (0 to 15)
    * j = thread_y (0 to 15)

### Which Threads Actually Work?
* **Working threads (9 threads):**
```
Matrix C indices ←→ Thread coordinates
C[0,0] ← Thread (0,0) → 1+1=2
C[0,1] ← Thread (0,1) → 2+2=4
C[0,2] ← Thread (0,2) → 3+3=6
C[1,0] ← Thread (1,0) → 4+4=8
C[1,1] ← Thread (1,1) → 5+5=10
C[1,2] ← Thread (1,2) → 6+6=12
C[2,0] ← Thread (2,0) → 7+7=14
C[2,1] ← Thread (2,1) → 8+8=16
C[2,2] ← Thread (2,2) → 9+9=18
```
* **Non-working threads (247 threads):**
    * Threads with i ≥ 3 OR j ≥ 3 skip the computation
    * Examples: (0,3), (3,0), (15,15) - all hit the boundary check and exit

### Visual Map of Active Threads
```
Active Threads/Matrix Elements:
🟩 🟩 🟩 ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
🟩 🟩 🟩 ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
🟩 🟩 🟩 ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
 ▫  ▫  ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
 ... 237 more inactive threads ...
```
where 🟩 = Active thread (computes one matrix element), ▫ = Inactive thread (skips computation due to boundary check)
* **Execution Pattern**
* GPU launches 256 threads in parallel
    * 9 threads find they have valid matrix indices (0-2, 0-2) and perform addition
    * 247 threads immediately exit because their coordinates are outside the matrix bounds
    * All threads complete simultaneously (GPU processes them in warps of 32)

If we had a 16×16 matrix instead: All 256 threads would be utilized , Perfect mapping: one thread per matrix element , No wasted computation.

### Performance Warnings
These are performance warnings, not errors! Your code will still run correctly, but Numba is giving you helpful suggestions to optimize GPU performance.

### Warning 1: Low Occupancy
`Grid size 1 will likely result in GPU under-utilization due to low occupancy.`
