<a href="https://colab.research.google.com/github/m-kelly444/cs5356-hw3/blob/main/2_conv1d_gpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1D Convolution on GPU

## 1. Set-up

In [15]:
# Mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [16]:
# Make sure your token is stored in a txt file at the location below.
# This way there is no risk that you will push it to your repo
# Never share your token with anyone, it is basically your github password!
with open('/content/gdrive/MyDrive/ece5545/token.txt') as f:
    token = f.readline().strip()
# Use another file to store your github username
with open('/content/gdrive/MyDrive/ece5545/git_username.txt') as f:
    handle = f.readline().strip()

In [17]:
# Clone your github repo
YOUR_TOKEN = token
YOUR_HANDLE = handle
BRANCH = "main"

!mkdir -p /content/gdrive/MyDrive/ece5545
# %cd /content/gdrive/MyDrive/ece5545
# !git clone https://{YOUR_TOKEN}@github.com/ML-HW-SYS/a3-{YOUR_HANDLE}.git
%cd /content/gdrive/MyDrive/ece5545/a3-{YOUR_HANDLE}
# !git checkout {BRANCH}
# !git pull
# %cd /content/gdrive/MyDrive/ece5545

PROJECT_ROOT = f"/content/gdrive/MyDrive/ece5545/a3-{YOUR_HANDLE}"

/content/gdrive/MyDrive/ece5545/a3-m-kelly444


In [18]:
# This extension reloads all imports before running each cell
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [19]:
!ls {PROJECT_ROOT}

 1-conv1d_cpu.ipynb    5-conv2d_dw_gpu.ipynb	     src
 2-conv1d_gpu.ipynb   'Copy of 1-conv1d_cpu.ipynb'   tests
 3-conv1d_fpga.ipynb  'Copy of 2-conv1d_gpu.ipynb'
 4-gemm_gpu.ipynb      README.md


## 2. Install TVM

In [20]:
!pip install tlcpack-nightly-cu102 -f https://tlcpack.ai/wheels
!pip install numpy==1.24.4

Looking in links: https://tlcpack.ai/wheels


## 3. Implement `make_conv1d_gpu_scheduler_func` function in `src.ops`

In that function, you are required to implemented 1D convolution and use TVM to optimize it.
Let $x \in \mathbb{R}^m$ and $y \in \mathbb{R}^n$, then
$$
\operatorname{conv1d}(x, y)_i = \sum_{j=-\infty}^{\infty} x[j]y[i-j], \forall i \in \{0, 1, \dots, m + n - 1\}
$$

Please use zero padding and unit stride. Please see the numpy convolution function for more detail: [link](https://numpy.org/doc/stable/reference/generated/numpy.convolve.html).

The `make_conv1d_gpu_scheduler_func` takes $m$ and $n$, which are the size of the two 1D input array.
You should return both the TVM scheduler and the TVM opterator for
1. Input $x$
2. Input $y$
3. Output $out$

The scheduler should be able to used to build a function with signature $func(x, y, out)$.
Please see the following cells for usage.

In [25]:
import tvm
import numpy as np
import sys
import os
import tvm
from tvm import te
# Adding assignment 3 to the system path
# Make sure this matches your git directory
sys.path.insert(0, PROJECT_ROOT)
from src.ops import make_conv1d_gpu_scheduler
M = 16384
N = 32
dtype = 'float32'
a_np = np.random.rand(M).astype(dtype)
w_np = np.random.rand(N).astype(dtype)
b_np = np.convolve(a_np, w_np)

s, A, W, B = make_conv1d_gpu_scheduler(M, N)
func = tvm.build(s, [A, W, B], "cuda")

dev = tvm.cuda(0)
a = tvm.nd.array(a_np, dev)
w = tvm.nd.array(w_np, dev)
b = tvm.nd.array(np.zeros((M+N-1), dtype), dev)
func(a, w, b)
evaluator = func.time_evaluator(func.entry_name, dev, number=1, repeat =1)


print("Answer:", b_np)
print("Output:", b)
print(f"1DConv TVM: %f ms" % (evaluator(a, w, b).mean * 1e3))

Answer: [0.26408884 0.072424   0.39652804 ... 0.91002446 0.2956373  0.02946377]
Output: [0.26408884 0.072424   0.39652804 ... 0.9100244  0.2956373  0.02946377]
1DConv TVM: 0.017247 ms


In [22]:
print(tvm.lower(s, [A, W, B], simple_mode=True))

# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer((16384,), "float32"), W: T.Buffer((32,), "float32"), B: T.Buffer((16415,), "float32")):
        T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
        padded_A = T.allocate([16446], "float32", "global")
        padded_A_1 = T.Buffer((16446,), data=padded_A)
        with T.launch_thread("blockIdx.y", 257) as blockIdx_y:
            threadIdx_y = T.launch_thread("threadIdx.y", 64)
            if T.likely(blockIdx_y * 32 + threadIdx_y // 2 < 8223):
                padded_A_1[blockIdx_y * 64 + threadIdx_y] = T.if_then_else(blockIdx_y * 64 + threadIdx_y < 31 or 16415 <= blockIdx_y * 64 + threadIdx_y, T.float32(0), A[blockIdx_y * 64 + threadIdx_y - 31])
        blockIdx_x = T.launch_thread("blockIdx.x", 257)
        threadIdx_x = T.launch_thread("threadIdx.x", 64)
        if T.likely(blockIdx_x * 64 + threadIdx_x < 16415

In [23]:
%cd {PROJECT_ROOT}
!python -m pytest tests/test_1dconv_gpu.py

/content/gdrive/MyDrive/ece5545/a3-m-kelly444
platform linux -- Python 3.11.12, pytest-8.3.5, pluggy-1.5.0
rootdir: /content/gdrive/MyDrive/ece5545/a3-m-kelly444
plugins: langsmith-0.3.24, anyio-4.9.0, typeguard-4.4.2
collected 15 items                                                             [0m

tests/test_1dconv_gpu.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                 [100%][0m

