<a href="https://colab.research.google.com/github/dlsys10714/notebooks/blob/main/17_hardware_acceleration_architecture_overview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 17: Hardware Acceleration Architecture Overview 

In this lecture, we will to walk through backend scafoldings to get us hardware accelerations for needle.




## Select a GPU runtime type
In this lecture, we are going to make use of c++ and CUDA to build accelerated linear algebra libraries. In order to do so, please make sure you select a runtime type with GPU and rerun the cells if needed:
- Click on the "Runtime" tab
- Click "Change runtime type"
- Select GPU

After you started the right runtime, you can run the following command to check if there is a GPU available.

In [1]:
!nvidia-smi

Sun Oct 31 22:37:42 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   58C    P8    33W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Prepare the codebase

To get started, we can clone the related lecture13 repo from the github. 

In [2]:
# Code to set up the assignment
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/
!mkdir -p 10714
%cd /content/drive/MyDrive/10714
# comment out the following line if you run it for the second time
# as you already have a local copy of lecture17
!rm -rf lecture17
!git clone https://github.com/dlsys10714/lecture17
!ln -s /content/drive/MyDrive/10714/lecture17 /content/needle

Mounted at /content/drive
/content/drive/MyDrive
/content/drive/MyDrive/10714
Cloning into 'lecture17'...
remote: Enumerating objects: 134, done.[K
remote: Counting objects: 100% (134/134), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 134 (delta 52), reused 133 (delta 51), pack-reused 0[K
Receiving objects: 100% (134/134), 11.19 MiB | 9.45 MiB/s, done.
Resolving deltas: 100% (52/52), done.


In [3]:
!python3 -m pip install pybind11

Collecting pybind11
  Downloading pybind11-2.8.1-py2.py3-none-any.whl (208 kB)
[?25l[K     |█▋                              | 10 kB 20.5 MB/s eta 0:00:01[K     |███▏                            | 20 kB 25.3 MB/s eta 0:00:01[K     |████▊                           | 30 kB 12.8 MB/s eta 0:00:01[K     |██████▎                         | 40 kB 10.0 MB/s eta 0:00:01[K     |███████▉                        | 51 kB 4.4 MB/s eta 0:00:01[K     |█████████▍                      | 61 kB 4.7 MB/s eta 0:00:01[K     |███████████                     | 71 kB 4.5 MB/s eta 0:00:01[K     |████████████▋                   | 81 kB 5.1 MB/s eta 0:00:01[K     |██████████████▏                 | 92 kB 5.0 MB/s eta 0:00:01[K     |███████████████▊                | 102 kB 4.3 MB/s eta 0:00:01[K     |█████████████████▎              | 112 kB 4.3 MB/s eta 0:00:01[K     |██████████████████▉             | 122 kB 4.3 MB/s eta 0:00:01[K     |████████████████████▍           | 133 kB 4.3 MB/s eta 0:00

### Build the needle cuda library

We leverage pybind to build a c++/cuda library for acceleration. You can type make to build the corresponding library.

In [4]:
%cd /content/needle
!make

/content/drive/MyDrive/10714/lecture17
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Python: /usr/bin/python3.7 (found version "3.7.12") found components:  Development Interpreter 
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Found pybind11: /usr/local/lib/python3.7/dist-packages/pybind11/include (found version "2.8.1" )
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looki

We can then run the following command to make the path to the package available in colab's environment as well as the PYTHONPATH.

In [5]:
%set_env PYTHONPATH /content/needle/python:/env/python
import sys
sys.path.append("/content/needle/python")

env: PYTHONPATH=/content/needle/python:/env/python


## Codebase walkthrough


Now click the files panel on the left side. You should be able to see these files

Python:
- needle/backend_ndarray/ndarray.py
- needle/backend_ndarray/ndarray_backend_numpy.py

C++/CUDA
- src/ndarray_backend_cpu.cc
- src/ndarray_backend_cuda.cu

The main goal of this lecture is to create an accelerated ndarray library.
As a result, we do not need to deal with needle.Tensor for now and will focus on backend_ndarray's implementation. After we build the array library, we can then use it to power the array computation in needle.


## Creating a CUDA NDArray






In [6]:
from needle import backend_ndarray as nd

In [7]:
x = nd.NDArray([1, 2, 3])

In [8]:
y = x + x

In [9]:
y

NDArray([2. 4. 6.], device=numpy_device())

We can create a CUDA tensor from the data by specifying a device keyword.

In [10]:
x = nd.NDArray([1, 2, 3], device=nd.cuda())

In [11]:
y = x + 1

In [12]:
x.numpy()

array([1., 2., 3.], dtype=float32)

In [13]:
x.device

cuda()

In [14]:
y = x + 1

In [15]:
y.device

cuda()

In [16]:
y.numpy()

array([2., 3., 4.], dtype=float32)

### Key Data Structures



## Trace GPU execution

Now, let us take a look at what happens when we execute the following code


In [17]:
x = nd.NDArray([1, 2, 3])
y = x + 1

Have the following trace:

backend_ndarray/ndarray.py
- `NDArray.__add__`
- `NDArray.ewise_or_scalar`
- `ndarray_backend_cpu.cc:ScalarAdd`

In [18]:
y.numpy()

array([2., 3., 4.], dtype=float32)

Have the following trace:

- `NDArray.numpy`
- `ndarray_backend_cpu.cc:to_numpy`

## Guidelines for Reading C++/CUDA related Files

Read
- src/ndarray_backend_cpu.cc
- src/ndarray_backend_cuda.cu


Optional
- CMakeLists.txt: this is used to setup the build and likely you do not need to tweak it.







## NDArray Data Structure

Open up `python/needle/backend_ndarray/ndarray.py`.

An NDArray contains the following fields:
- handle: The backend handle that build a flat array which stores the data.
- shape: The shape of the NDArray
- strides: The strides that shows how do we access multi-dimensional elements
- offset: The offset of the first element.
- device: The backend device that backs the computation






## CUDA Acceleration

Now let us open `src/ndarray_cuda_backend.cu` and take a look at current implementation of GPU ops.


## Steps for adding a new operator implementation
- Add an implementation in `ndarray_backend_cuda.cu`, expose via pybind
- Call into the operator in ndarray.py
- Write up testcases

In [19]:
!make

-- Found pybind11: /usr/local/lib/python3.7/dist-packages/pybind11/include (found version "2.8.1" )
-- Found cuda, building cuda backend
-- Autodetected CUDA architecture(s):  3.7
-- Configuring done
-- Generating done
-- Build files have been written to: /content/drive/MyDrive/10714/lecture17/build
make[1]: Entering directory '/content/drive/MyDrive/10714/lecture17/build'
make[2]: Entering directory '/content/drive/MyDrive/10714/lecture17/build'
make[3]: Entering directory '/content/drive/MyDrive/10714/lecture17/build'
make[3]: Leaving directory '/content/drive/MyDrive/10714/lecture17/build'
make[3]: Entering directory '/content/drive/MyDrive/10714/lecture17/build'
[-25%] [32m[1mLinking CXX shared module ../python/needle/backend_ndarray/ndarray_backend_cuda.cpython-37m-x86_64-linux-gnu.so[0m
make[3]: Leaving directory '/content/drive/MyDrive/10714/lecture17/build'
[  0%] Built target ndarray_backend_cuda
make[3]: Entering directory '/content/drive/MyDrive/10714/lecture17/build'
mak

In [20]:
import needle as ndl
x = ndl.Tensor([1,2,3], device=ndl.cuda(), dtype="float32")
y = ndl.Tensor([2,3,5], device=ndl.cuda(), dtype="float32")
x + y


In [21]:
!nvprof python tests/test_backend_ndarray.py

(1, 2, 3)
==884== NVPROF is profiling process 884, command: python3 tests/test_backend_ndarray.py
==884== Profiling application: python3 tests/test_backend_ndarray.py
==884== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   47.53%  4.9280us         2  2.4640us  2.3680us  2.5600us  [CUDA memcpy DtoH]
                   30.25%  3.1360us         1  3.1360us  3.1360us  3.1360us  needle::cuda::ScalarAddKernel(float const *, float, float*, unsigned long)
                   22.22%  2.3040us         1  2.3040us  2.3040us  2.3040us  [CUDA memcpy HtoD]
      API calls:   99.61%  248.01ms         2  124.00ms  8.5630us  248.00ms  cudaMalloc
                    0.19%  473.07us         1  473.07us  473.07us  473.07us  cuDeviceTotalMem
                    0.09%  223.82us       101  2.2150us     160ns  102.07us  cuDeviceGetAttribute
                    0.06%  138.27us         2  69.133us  11.803us  126.46us  cudaFree
               

## Write Standalone Python Test Files

Now that we have additional c++/cuda libraries in needle, we will need to type make in order to rebuild the library. Additionally, because the colab environment caches the old library, it is inconvenient to use the ipython cells to debug the updated library.




In [22]:
!make

-- Found pybind11: /usr/local/lib/python3.7/dist-packages/pybind11/include (found version "2.8.1" )
-- Found cuda, building cuda backend
-- Autodetected CUDA architecture(s):  3.7
-- Configuring done
-- Generating done
-- Build files have been written to: /content/drive/MyDrive/10714/lecture17/build
make[1]: Entering directory '/content/drive/MyDrive/10714/lecture17/build'
make[2]: Entering directory '/content/drive/MyDrive/10714/lecture17/build'
make[3]: Entering directory '/content/drive/MyDrive/10714/lecture17/build'
make[3]: Leaving directory '/content/drive/MyDrive/10714/lecture17/build'
[  0%] Built target ndarray_backend_cuda
make[3]: Entering directory '/content/drive/MyDrive/10714/lecture17/build'
make[3]: Leaving directory '/content/drive/MyDrive/10714/lecture17/build'
[ 50%] Built target ndarray_backend_cpu
make[2]: Leaving directory '/content/drive/MyDrive/10714/lecture17/build'
make[1]: Leaving directory '/content/drive/MyDrive/10714/lecture17/build'



We recommend writing separate python files and invoke them from the command line. Create a new file `tests/mytest.py` and write your local tests. This is also a common develop practice in big projects that involves python c++ FFI.

In [23]:
!python tests/mytest.py

python3: can't open file 'tests/mytest.py': [Errno 2] No such file or directory


After we have building the library, we could choose to fully restart the runtime (factory reset runtime) if you want to bring the updated change back to another colab. Note that you will need to save your code changes to the drive or a private github repo.