## DEMO Code CUDA Acceleration

We'll compare time differiential between `naive` vs `optimized` matrix multiplications on NxN matrices of size 512, 1024, 2048, 4096, 8192, and 16384.


### Installs

In [70]:
# Installing NVCC
!pip install git+https://github.com/andreinechaev/nvcc4jupyter.git
%load_ext nvcc4jupyter

Collecting git+https://github.com/andreinechaev/nvcc4jupyter.git
  Cloning https://github.com/andreinechaev/nvcc4jupyter.git to /tmp/pip-req-build-a6eyk6dn
  Running command git clone --filter=blob:none --quiet https://github.com/andreinechaev/nvcc4jupyter.git /tmp/pip-req-build-a6eyk6dn
  Resolved https://github.com/andreinechaev/nvcc4jupyter.git to commit 28f872a2f99a1b201bcd0db14fdbc5a496b9bfd7
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
The nvcc4jupyter extension is already loaded. To reload it, use:
  %reload_ext nvcc4jupyter


In [71]:
# Py packages for visualization
import pandas as pd
import matplotlib.pyplot as plt
import io

In [86]:
!git clone https://github.com/mrzkp/llmsys_resources.git

%cd llmsys_resources

fatal: destination path 'llmsys_resources' already exists and is not an empty directory.
/content/llmsys_resources


### Compile

In [87]:
!nvcc -arch=sm_75 matmul_benchmark.cu -o matmul_benchmark -O3

### Visualize

In [None]:
# data
output = !./matmul_benchmark
data = "\n".join([line for line in output if line.count(",") == 2])
print(data)
df = pd.read_csv(io.StringIO(data), names=['Size', 'Naive_ms', 'Tiled_ms'], header=None)


# plot
plt.figure(figsize=(12, 7))
plt.plot(df['Size'], df['Naive_ms'], marker='o', label='Naive Kernel')
plt.plot(df['Size'], df['Tiled_ms'], marker='s', label='Tiled Kernel (32x32)')

plt.title('Naive vs. Tiled Performance', fontsize=14)
plt.xlabel('Matrix Dimension (N)', fontsize=12)
plt.ylabel('Exec Time (ms)', fontsize=12)
plt.legend()
plt.grid(True)
plt.show()

Naive,Tiled
512,0.886624,0.748896
1024,6.73814,5.32704
2048,50.5786,38.2878
4096,207.833,142.135
8192,1695.45,1132.1


### Sample performance comparison between Naive vs. Tiled

| Matrix Size ($N$) | Naive Time (ms) | Tiled Time (ms) | Speedup Ratio |
| :--- | :---: | :---: | :---: |
| **512** | 0.887 | 0.733 | 1.21x |
| **1,024** | 6.737 | 5.329 | 1.25x |
| **2,048** | 53.608 | 42.004 | 1.28x |
| **4,096** | 218.481 | 148.042 | 1.48x |
| **8,192** | 1,807.72 | 1,181.44 | 1.53x |
| **16,384** | 16,326.1 | 10,052.9 | 1.62x |

---

## Further Questions and Explanations


---


### Why does the speedup increase as N increases?
As N increases, the number of data grows by a factor of N^2, hence, requiring exponentially more global memory reads.

Furthemore, depending on the GPU architecture, there is a L2 cache.

In our case, T4 GPU has a 6 MB L2 cache.

We can solve for the maximal N that can fit in the L2 cache (assuming 4 byte floating point numbers), via: 6 MB = N x N x 4 Bytes, or N ~= 1224. Since memory bandwidth from L2 to registers is much faster than main memory to registers, Naive performs well comparatively to the Optimized.

---

### How would we get better speedup?

Similar to tiling, there is something called `register tiling`. This is, essentially, another level of tiling where we place our data at the register level instead of just using shared memory. It's like a `cache for the cache`.

Other techinques include double (or even triple) buffering in libraies such as `cuBLAS`. These are much more advanced and aren't necessary for your homework assignments (but you are free to do your own research and try to implement them).