# Depthwise-seperable 2D Convolution on GPU

## 1. Set-up

In [1]:
# Mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [2]:
# Make sure your token is stored in a txt file at the location below.
# This way there is no risk that you will push it to your repo
# Never share your token with anyone, it is basically your github password!
with open('/content/gdrive/MyDrive/ece5545/token.txt') as f:
    token = f.readline().strip()
# Use another file to store your github username
with open('/content/gdrive/MyDrive/ece5545/git_username.txt') as f:
    handle = f.readline().strip()

In [3]:
# Clone your github repo
YOUR_TOKEN = token
YOUR_HANDLE = handle
BRANCH = "main"

%mkdir /content/gdrive/MyDrive/ece5545
%cd /content/gdrive/MyDrive/ece5545
!git clone https://{YOUR_TOKEN}@github.com/ML-HW-SYS/a3-{YOUR_HANDLE}.git
%cd /content/gdrive/MyDrive/ece5545/a3-{YOUR_HANDLE}
!git checkout {BRANCH}
!git pull
%cd /content/gdrive/MyDrive/ece5545

PROJECT_ROOT = f"/content/gdrive/MyDrive/ece5545/a3-{YOUR_HANDLE}"

mkdir: cannot create directory ‘/content/gdrive/MyDrive/ece5545’: File exists
/content/gdrive/MyDrive/ece5545
fatal: destination path 'a3-kpan02' already exists and is not an empty directory.
/content/gdrive/MyDrive/ece5545/a3-kpan02
M	src/ops.py
Already on 'main'
Your branch is up to date with 'origin/main'.
Already up to date.
/content/gdrive/MyDrive/ece5545


In [4]:
# This extension reloads all imports before running each cell
%load_ext autoreload
%autoreload 2

In [5]:
!ls {PROJECT_ROOT}

1-conv1d_cpu.ipynb   4-gemm_gpu.ipynb	    README.md
2-conv1d_gpu.ipynb   5-conv2d_dw_gpu.ipynb  src
3-conv1d_fpga.ipynb  leaderboard_id.txt     tests


## 2 Install TVM

In [6]:
!pip install numpy==1.24.3
!pip install tlcpack-nightly-cu102 -f https://tlcpack.ai/wheels

Looking in links: https://tlcpack.ai/wheels


## 3. Implement `make_dwsp_conv2d_gpu_scheduler` function in `src.ops`

In that function, you are required to implemented 2D convolution and use TVM to optimize it.
Please use zero padding and unit stride.
You can assume kernel size to be an odd number.
The padding will equals to kernel size minus ones.
In this case, the output image will preserve the input image dimension.

The `make_dwsp_conv2d_gpu_scheduler` takes following arguments:
1. Batch size $B$;
2. Input channel size $C$;
3. Input image height $H$;
4. Input image width $W$;
5. Output number of channels $O$;
6. Kernel size $K$

You should return both the TVM scheduler and the TVM opterator for
1. Input tensor $x$ with size (B, C, H, W)
2. Input kernel weight $y$ with size (O, 1, K, K)
3. Output $out$ with size (B, O, H, W)

The scheduler should be able to used to build a function with signature $func(x, y, out)$.
Please see the following cells the usage.

In [7]:
import tvm
import torch.nn.functional as F
import numpy as np
import sys
# Adding assignment 3 to the system path
# Make sure this matches your git directory
sys.path.insert(0, PROJECT_ROOT)
from src.ops import make_dwsp_conv2d_gpu_scheduler

B = 3
C = 4
H = 16
W = 32
K = 7
dtype = 'float32'
a_np = np.random.rand(B, C, H, W).astype(dtype)
w_np = np.random.rand(C, 1, K, K).astype(dtype)

s, inp, ker, out = make_dwsp_conv2d_gpu_scheduler(B, C, H, W, K)
func = tvm.build(s, [inp, ker, out], "cuda")

dev = tvm.cuda(0)
a = tvm.nd.array(a_np, dev)
w = tvm.nd.array(w_np, dev)
b = tvm.nd.array(np.zeros((B, C, H, W), dtype), dev)
func(a, w, b)
evaluator = func.time_evaluator(func.entry_name, dev, number=1, repeat =1)

# print("Output:", b)
print(f"2DConv TVM: %f ms" % (evaluator(a, w, b).mean * 1e3))

2DConv TVM: 0.023808 ms


In [8]:
print(tvm.lower(s, [inp, ker, out], simple_mode=True))

# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer((3, 4, 16, 32), "float32"), W: T.Buffer((4, 1, 7, 7), "float32"), out: T.Buffer((3, 4, 16, 32), "float32")):
        T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
        blockIdx_z = T.launch_thread("blockIdx.z", 3)
        blockIdx_y = T.launch_thread("blockIdx.y", 1)
        blockIdx_x = T.launch_thread("blockIdx.x", 2)
        threadIdx_z = T.launch_thread("threadIdx.z", 4)
        threadIdx_y = T.launch_thread("threadIdx.y", 16)
        threadIdx_x = T.launch_thread("threadIdx.x", 16)
        out_1 = T.Buffer((6144,), data=out.data)
        out_1[blockIdx_z * 2048 + threadIdx_z * 512 + threadIdx_y * 32 + blockIdx_x * 16 + threadIdx_x] = T.float32(0)
        for r_h, r_w in T.grid(7, 7):
            A_1 = T.Buffer((6144,), data=A.data)
            W_1 = T.Buffer((196,), data=W.data)
            out_1[blockIdx_

In [9]:
%cd {PROJECT_ROOT}
!python -m pytest tests/test_dwsp_2dconv_gpu.py

/content/gdrive/MyDrive/ece5545/a3-kpan02
platform linux -- Python 3.11.12, pytest-8.3.5, pluggy-1.5.0
rootdir: /content/gdrive/MyDrive/ece5545/a3-kpan02
plugins: typeguard-4.4.2, anyio-4.9.0, langsmith-0.3.24
collected 1357 items                                                           [0m

tests/test_dwsp_2dconv_gpu.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m [  3%]
[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[