<a href="https://colab.research.google.com/github/dlsys10714/notebooks/blob/main/13_hardware_acceleration_implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 13: Hardware Acceleration Implementation 

In this lecture, we will to walk through backend scafoldings to get us hardware accelerations for needle.




## Select a GPU runtime type
In this lecture, we are going to make use of c++ and CUDA to build accelerated linear algebra libraries. In order to do so, please make sure you select a runtime type with GPU and rerun the cells if needed:
- Click on the "Runtime" tab
- Click "Change runtime type"
- Select GPU

After you started the right runtime, you can run the following command to check if there is a GPU available.

In [1]:
!nvidia-smi

Wed Oct 13 15:15:56 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Prepare the codebase

To get started, we can clone the related lecture13 repo from the github. 

In [2]:
# Code to set up the assignment
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/
!mkdir -p 10714
%cd /content/drive/MyDrive/10714
# comment out the following line if you run it for the second time
# as you already have a local copy of lecture13
!git clone https://github.com/dlsys10714/lecture13 
!ln -s /content/drive/MyDrive/10714/lecture13 /content/needle

Mounted at /content/drive
/content/drive/MyDrive
/content/drive/MyDrive/10714
fatal: destination path 'lecture13' already exists and is not an empty directory.


In [3]:
!python3 -m pip install pybind11

Collecting pybind11
  Downloading pybind11-2.8.0-py2.py3-none-any.whl (207 kB)
[?25l[K     |█▋                              | 10 kB 26.4 MB/s eta 0:00:01[K     |███▏                            | 20 kB 19.8 MB/s eta 0:00:01[K     |████▊                           | 30 kB 10.5 MB/s eta 0:00:01[K     |██████▎                         | 40 kB 8.6 MB/s eta 0:00:01[K     |███████▉                        | 51 kB 4.2 MB/s eta 0:00:01[K     |█████████▌                      | 61 kB 4.6 MB/s eta 0:00:01[K     |███████████                     | 71 kB 4.1 MB/s eta 0:00:01[K     |████████████▋                   | 81 kB 4.6 MB/s eta 0:00:01[K     |██████████████▏                 | 92 kB 4.7 MB/s eta 0:00:01[K     |███████████████▊                | 102 kB 4.1 MB/s eta 0:00:01[K     |█████████████████▍              | 112 kB 4.1 MB/s eta 0:00:01[K     |███████████████████             | 122 kB 4.1 MB/s eta 0:00:01[K     |████████████████████▌           | 133 kB 4.1 MB/s eta 0:00:

### Build the needle cuda library

We leverage pybind to build a c++/cuda library for acceleration. You can type make to build the corresponding library.

In [4]:
%cd /content/needle
!make

/content/drive/MyDrive/10714/lecture13
-- Found pybind11: /usr/local/lib/python3.7/dist-packages/pybind11/include (found version "2.8.0" )
-- Find cuda, build with cuda support
-- Autodetected CUDA architecture(s):  3.7
-- Configuring done
-- Generating done
-- Build files have been written to: /content/drive/MyDrive/10714/lecture13/build
make[1]: Entering directory '/content/drive/MyDrive/10714/lecture13/build'
make[2]: Entering directory '/content/drive/MyDrive/10714/lecture13/build'
make[3]: Entering directory '/content/drive/MyDrive/10714/lecture13/build'
[35m[1mScanning dependencies of target main[0m
make[3]: Leaving directory '/content/drive/MyDrive/10714/lecture13/build'
make[3]: Entering directory '/content/drive/MyDrive/10714/lecture13/build'
[-14%] [32mBuilding CXX object CMakeFiles/main.dir/python/pybind/main.cc.o[0m
[  0%] [32m[1mLinking CXX shared module ../python/needle/_ffi/main.cpython-37m-x86_64-linux-gnu.so[0m
make[3]: Leaving directory '/content/drive/MyDrive

We can then run the following command to make the path to the package available in colab's environment as well as the PYTHONPATH.

In [None]:
%set_env PYTHONPATH /content/needle/python:/env/python
import sys
sys.path.append("/content/needle/python")

## Codebase walkthrough


Now click the files panel on the left side. You should be able to see these new files:

- needle/include/needle
    - cuda_ops.h
    - device_api.h
    - dlpack.h
    - logging.h
    - ndarray.h
- needle/src/
    - cpu_device_api.cc
    - cuda_device_api.cc
    - device_api.cc
    - device_api_internal.h
    - ndarray.cc
- needle/python/pybind
    - main.cc
- needle/python
    - backend_ndarray.py
    - cuda_backend.py

Our framework is called needle. Needle stands for necessary elements of deep learning. You can also viewed it as a sewing needle that threads through clothes to form (neural)net patterns, and the create traces for automatic differentiation.


## Creating a CUDA Tensor






In [None]:
import needle as ndl

We can create a CUDA tensor from the data by specifying a device keyword.

In [None]:
x = ndl.Tensor([1, 2, 3], dtype="float32", device=ndl.cuda())

In [None]:
x

needle.Tensor([1. 2. 3.])

In [None]:
x.device

cuda(0)

In [None]:
y = x + 1

In [None]:
y.device

cuda(0)

In [None]:
y.numpy()

array([2., 3., 4.], dtype=float32)

### Key Data Structures

C++ side
- NDArray: exposes an n-dimensional array data structure

Python side:
- backend_ndarray.NDArray: wraps the C++ side of computation

Pybind bridge:
- pybind/main.cc

## Trace GPU execution

Now, let us take a look at what happens when we execute the following code


In [None]:
x = ndl.Tensor([1, 2, 3], dtype="float32", device=ndl.cuda())

Have the following trace:
- `autograd.Tensor.__init__`
- `cuda_backend.CUDADevice.array`
- `backend_ndarray.array`
    - `backend_ndarray.empty`
    - `_ffi.empty`
    - `pybind/main.cc:empty`
    - `include/needle/ndarray.h: NDArray::Empty`
- `backend_ndarray.NDArray.copyfrom`
    - `pybind/main.cc:copyfrombytes`
    - `include/needle/ndarray.h: NDArray::CopyFromBytes`




In [None]:
y = x + 1

Have the following trace:

- `cuda_backend.add_scalar`
- `_ffi.CUDAAddScalar`
- `include/needle/cuda_ops.h: CUDAAddScalar`
- `src/cuda_ops.cu: CUDAAddScalar`

In [None]:
y.numpy()

array([2., 3., 4.], dtype=float32)

Have the following trace:


## Guidelines for Reading C++/CUDA related Files

The project contains around 1000 lines of scafolding code.
You are more than welcomed to read all of them to get a full picture of the project. However, here are some files that you can feel free to skip

Free to skip: you only need to know how to use them(`e.g. type make`) but not the implementation details.

- CMakeLists.txt: this is used to setup the build and likely you do not need to tweak it.
- include/needle/logging.h: A minimum glog style helper that enables `LOG(INFO) << "message"` and `CHECK(condition) << "message"`, you do not need to understand the implementation, as long as you know how to use them.

Good to read: these are the files we recommend you to read and understand, but likely you do not need to update them in your homework.

- device_api.h
- ndarray.h
- dlpack.h
- pybind/main.cc

Need to update in your homework: you will need to update these files in your homework

- cuda_ops.h
- cuda_ops.cu






## C++ NDArray Data Structure

- Open up `include/needle/ndarray.h` NDArray is contains `shared_ptr` to a Container object
- The container object wraps DLTensor, which is a standard data structure for defining tensors in memory
- The actual data allocations are defined by DeviceAPI(`device_api.h`)




## CUDA Acceleration

-  Now let us open `src/cuda_ops.cu` and take a look at current implementation of GPU ops.
- Note that all the ops takes NDArray that contains pre-allocated GPU pointers. The allocations are defined in `src/cuda_device_api.cc`

## Steps for adding a new operator implementation
- Add operator declaration to needle/cuda_ops.h
- Implement the cuda operator in src/cuda_ops.cu
- Expose the API to python ffi through pybind/main.cc
- Call into the API in cuda_backend.py

## Write Standalone Python Test Files

Now that we have additional c++/cuda libraries in needle, we will need to type make in order to rebuild the library. Additionally, because the colab environment caches the old library, it is inconvenient to use the ipython cells to debug the updated library.




In [None]:
!make

-- Found pybind11: /usr/local/lib/python3.7/dist-packages/pybind11/include (found version "2.8.0" )
-- Find cuda, build with cuda support
-- Autodetected CUDA architecture(s):  3.7
-- Configuring done
-- Generating done
-- Build files have been written to: /content/drive/My Drive/10714/lecture13/build
make[1]: Entering directory '/content/drive/My Drive/10714/lecture13/build'
make[2]: Entering directory '/content/drive/My Drive/10714/lecture13/build'
make[3]: Entering directory '/content/drive/My Drive/10714/lecture13/build'
make[3]: Leaving directory '/content/drive/My Drive/10714/lecture13/build'
make[3]: Entering directory '/content/drive/My Drive/10714/lecture13/build'
[-14%] [32m[1mLinking CXX shared module ../python/needle/_ffi/main.cpython-37m-x86_64-linux-gnu.so[0m
make[3]: Leaving directory '/content/drive/My Drive/10714/lecture13/build'
[ 71%] Built target main
make[2]: Leaving directory '/content/drive/My Drive/10714/lecture13/build'
make[1]: Leaving directory '/content/d


We recommend writing separate python files and invoke them from the command line. Create a new file `tests/mytest.py` and write your local tests. This is also a common develop practice in big projects that involves python c++ FFI.

In [None]:
!python tests/mytest.py

After we have building the library, we could choose to fully restart the runtime (factory reset runtime) if you want to bring the updated change back to another colab. Note that you will need to save your code changes to the drive or a private github repo.