## CSC14120 – PARALLEL PROGRAMMING - Final project report

### Authors
Student 1: Huynh Minh Tuan - 20120024  
Student 2: Huynh Minh Tu - 20120393

### Assignment
Huynh Minh Tu:
- Setup and re-organize the dnn project using bazel instead of cmake.
- Implement basic parallel version of convolutional layer.
- Implement tiled shared memory parallel version of convolutional layer.
- Train modified LeNet-5 model, setup inference and testset evaluate.
- Write report.

Huynh Minh Tuan:
- Setup third party Eigen in bazel.
- Setup cuda compile in bazel.
- Upgrade and support implementing tiled shared memory parallel version of convolutional layer.
- Implement batch samples parallel version of convolutional layer.
- Setup run configs for inference and report.
- Write report.

### Setup
In this project, we setup the whole code with [bazel](https://bazel.build/) instead of `CMake`. `Bazel` provides simple code compile setup and supports cuda as well.

- Check [here](https://bazel.build/install) for how to install bazel in each OS.
- Check [here](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html) for installing cuda.

In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [2]:
!nvidia-smi

Mon Dec 25 14:45:49 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4060 ...    Off | 00000000:01:00.0  On |                  N/A |
| N/A   45C    P8               2W /  60W |     59MiB /  8188MiB |     20%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [1]:
!bazel --version

bazel 6.4.0


### Overview
#### Problem statement
Implementing and optimizing the forward-pass of convolutional layers in modified LeNet-5 using C++ and CUDA.

An example of a process of CNN model.  


<img src="images/cnn_illustration.png" width="600" align="center"/>

In the scope of this project, given [MNIST fashion dataset](https://github.com/zalandoresearch/fashion-mnist), we need to implement and train a modified LeNet-5 model that is able to predict the fashion type of an input image using C++.


#### How can GPU help to speed up the process
Inside a computer vision network, processing matrices is essential. In sequential way, matrix processes each pixel step by step.

An observation shows that each pixel inside a matrix can be handled independently. Accordingly, we can utilize the power of GPU to force the network process multiple pixels in a single matrix simultaneously.

In this project, we focus on using cuda in optimizing the speed performance of convolutional layers in modified LeNet-5.

### Convolutional layer in CNN
Firstly, to get more insights about how our optimization works. Let's take a look into convolutional layer in CNN. 

The figure below shows the process of a convolutional layer.

- A single input matrix would be `channel_in * width_in * height_in`.  
- A kernel matrix has the shape `channel_in * kernel_width * kernel_height * channel_out`.  
- A single output matrix results `channel_out * width_out * height_out`.

<img src="images/conv.png" width="800" align="center"/>

#### 1. Sequential implementation

In sequential implementation, the given code of convolutionl layers does following these steps:
- Kernel is represented as a vector.
- Iterating each pixel in output matrix.
- For each pixel, find the corresponding pixels in input matrix and arrange the pixels' order following the kernel as a vector.
-   .

#### 2. Parallel implementation

We discuss 2 parts that are able to apply parallel process.
- Each pixel in output matrix can be handled independently.
- In CNN, for boosting the performance, usually model handles multiple samples at once. As a result, we decide to follow the idea: process a batch sample at once.

### Parallel optimization
#### 1. Illustration
Below is the workflow of our final optimaztion (the fastest version).

The workflow would be:
- A `blockSize` is defined. `blockSize` has 3 dims, we ultilize all of them.
- The first and second dims (`blockSize.x`, `blockSize.y`) to parallelize each single input matrix. For each output tile, we get the necessary input tile and copy to SMEM, and apply kernel filter on SMEM instead of GMEM.
- Using the same idea of processing batch size in most of AI frameworks (torch, tensorflow). The third dim (`blockSize.z`) is used to handle `n_samples` data, make them process simultaneously. By using `cudaEvent`, dividing batch data into streams, we are able to make the performance better.

<img src="images/optimize_conv.png" width="1000" align="center"/>

#### 2. Versions
In order to show the impacts of optimzations clearly, we have implemented 3 versions for the optimization.
- Version 1: Simple conv implementation, utilizing parallel processing in cuda.
- Version 2: Using tiled shared memory convolution for each input matrix.
- Version 3: Upgrade from version 2, adding cuda streams to handle a batch data sample simultaneously.

The figure above illustrates the workflow of version 3, which is the optimal and fastest version.

### Evaluate
- To ensure that our conv implementation returns correct value, we have trained model with the given data MNIST Fashion. The accuracy on the test data is around 0.82.
- For benchmarking, we evaluate each version of cuda conv on the test dataset. Note that we only need to focus the elapsed time of layer 1 and 4, which are convolutional layers.

In [21]:
### Version 0 (host version)
!bazel run --noshow_progress //:inference --config=report

(15:23:13) [32mINFO: [0mCurrent date is 2023-12-25
(15:23:13) [32mINFO: [0mBuild option --cxxopt has changed, discarding analysis cache.
(15:23:13) [32mINFO: [0mAnalyzed target //:inference (0 packages loaded, 2367 targets configured).
(15:23:13) [32mINFO: [0mFound 1 target...
Target //:inference up-to-date:
  bazel-bin/inference
(15:23:14) [32mINFO: [0mElapsed time: 1.813s, Critical Path: 1.70s
(15:23:14) [32mINFO: [0m15 processes: 1 internal, 14 local.
(15:23:14) [32mINFO: [0mRunning command line: bazel-bin/inference
[0mObject loaded from binary file: weights/lenet5_mnist_weight
--------------------------------
|  Network | Elapsed Time (ms) |
--------------------------------
| Layer 1  |             13690 |
| Layer 2  |               178 |
| Layer 3  |              4288 |
| Layer 4  |             10399 |
| Layer 5  |                52 |
| Layer 6  |              1308 |
| Layer 7  |               613 |
| Layer 8  |                 5 |
| Layer 9  |               216 |
|

In [18]:
### Version 1
!bazel run --noshow_progress //:inference --config=cuda --config=report --//:conv_ver=v1

(15:22:32) [32mINFO: [0mCurrent date is 2023-12-25
(15:22:32) [32mINFO: [0mAnalyzed target //:inference (0 packages loaded, 0 targets configured).
(15:22:32) [32mINFO: [0mFound 1 target...
Target //:inference up-to-date:
  bazel-bin/inference
(15:22:32) [32mINFO: [0mElapsed time: 0.053s, Critical Path: 0.00s
(15:22:32) [32mINFO: [0m1 process: 1 internal.
(15:22:32) [32mINFO: [0mRunning command line: bazel-bin/inference
[0mObject loaded from binary file: weights/lenet5_mnist_weight
--------------------------------
|  Network | Elapsed Time (ms) |
--------------------------------
| Layer 1  |               519 |
| Layer 2  |               182 |
| Layer 3  |              4324 |
| Layer 4  |              1398 |
| Layer 5  |                54 |
| Layer 6  |              1315 |
| Layer 7  |               611 |
| Layer 8  |                 6 |
| Layer 9  |               216 |
| Layer 10 |                 3 |
| Layer 11 |                21 |
| Layer 12 |                10 |
------

In [19]:
### Version 2
!bazel run --noshow_progress //:inference --config=cuda  --config=report --//:conv_ver=v2

(15:22:44) [32mINFO: [0mCurrent date is 2023-12-25
(15:22:44) [32mINFO: [0mBuild option --//:conv_ver has changed, discarding analysis cache.
(15:22:44) [32mINFO: [0mAnalyzed target //:inference (0 packages loaded, 2367 targets configured).
(15:22:44) [32mINFO: [0mFound 1 target...
Target //:inference up-to-date:
  bazel-bin/inference
(15:22:46) [32mINFO: [0mElapsed time: 1.837s, Critical Path: 1.71s
(15:22:46) [32mINFO: [0m11 processes: 1 internal, 10 local.
(15:22:46) [32mINFO: [0mRunning command line: bazel-bin/inference
[0mObject loaded from binary file: weights/lenet5_mnist_weight
--------------------------------
|  Network | Elapsed Time (ms) |
--------------------------------
| Layer 1  |               396 |
| Layer 2  |               183 |
| Layer 3  |              4327 |
| Layer 4  |              1072 |
| Layer 5  |                54 |
| Layer 6  |              1313 |
| Layer 7  |               612 |
| Layer 8  |                 6 |
| Layer 9  |               21

In [20]:
### Version 3
!bazel run --noshow_progress //:inference --config=cuda --config=report --//:conv_ver=v3

(15:22:57) [32mINFO: [0mCurrent date is 2023-12-25
(15:22:57) [32mINFO: [0mBuild option --//:conv_ver has changed, discarding analysis cache.
(15:22:57) [32mINFO: [0mAnalyzed target //:inference (0 packages loaded, 2367 targets configured).
(15:22:57) [32mINFO: [0mFound 1 target...
Target //:inference up-to-date:
  bazel-bin/inference
(15:22:59) [32mINFO: [0mElapsed time: 1.746s, Critical Path: 1.62s
(15:22:59) [32mINFO: [0m11 processes: 1 internal, 10 local.
(15:22:59) [32mINFO: [0mRunning command line: bazel-bin/inference
[0mObject loaded from binary file: weights/lenet5_mnist_weight
--------------------------------
|  Network | Elapsed Time (ms) |
--------------------------------
| Layer 1  |                56 |
| Layer 2  |               179 |
| Layer 3  |              4286 |
| Layer 4  |                28 |
| Layer 5  |                52 |
| Layer 6  |              1305 |
| Layer 7  |               613 |
| Layer 8  |                 6 |
| Layer 9  |               21

All estimated time was calculated in miliseconds.

| Version       | Layer 1          | Layer 4        |
| ------------- | ---------------- | -------------- |
| 0 (host)      | 13690            | 10399          |
| 1             | 519              | 1398           |
| 2             | 396              | 1072           |
| 3             | 56               | 28             |


From the result, can see that the version 3 has a significant improvement.

### Reflection
#### Each member
- Huynh Minh Tu:
    - Difficulties:
        - Setup and re-organize the whole project with bazel instead of cmake.
        - The way that matrix in Eigen works and allocates in memory gives a lot of difficuties to implement a cuda version.
    - Learns:
        - Compiling C++ using Bazel tool.
        - Making an cuda-based C++ object in a project.

- Huynh Minh Tuan:
    - Difficulties:
        - Setup third-party Eigen in bazel.
    - Learns:
        - Cuda setup with bazel.
        - Implement batch data samples with C++ and cuda.

#### Further plans
- We haven't tried using atomic add in channel dim of matrix because the `blockSize` limits at 3 dims. But we believe that perhaps there is a way to do it.

### Youtube video

[video link](https://youtu.be/MWuiGVVIjVw)