# Numba Acceleration 

Before we begin, let us execute the below cell to display information about the NVIDIA® CUDA® driver and the GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by clicking on it with your mouse, and pressing Ctrl+Enter, or pressing the play button in the toolbar above. You should see some output returned below the grey cell.

In [None]:
!nvidia-smi

## Copy the Serial code

Before start modifying the serial code, let's copy the serial code to cupy folder by running the cell below.

In [None]:
!cp ../source_code/serial/* ../source_code/numba

## Run the Serial code

In [None]:
%run ../source_code/numba/cfd.py 64 500

---

# Start Adding Numba Constructs

Now, you can start modifying the Python code: 

[cfd.py](../source_code/numba/cfd.py)

Remember to **SAVE** your code after changes, before running below cells.

#### Some Hints
The serial code consists of the `main, jacobi, and write_data` functions. Focus more the jacobi and main functions. Remember to import the cupy library as: ```from numba import cuda ``` at the top of your code. Check if there is any data race in your code.

##  Run and Profile the CuPy code

In [None]:
!cd ../source_code/numba && nsys profile -t nvtx --stats=true --force-overwrite true -o minicfdnumba_profile python3 cfd.py 64 500

You can examine the output on the terminal or you can download the file and view the timeline by opening the output with the NVIDIA Nsight Systems.

Download and save the report file by holding down <mark>Shift</mark> and <mark>right-clicking</mark> [here](../source_code/numba/minicfdnumba_profile.nsys-rep) then choosing <mark>save Link As</mark>. Once done, open it via the GUI.


## Validating the Output

Make sure the error value printed as output matches that of the serial code


# Recommendations for adding Numba Constructs

After finding the hotspot function take an incremental approach: 

1) Add `@cuda.jit()` decorator at the top of the function or rewrite the function as a raw kernel(this is rather tedious)

2) You may need to perform a copy-swap data in a different kernel function

3) Ignore the I/O function

4) Ensure that only required data moves from `host (CPU function)` to `device (GPU function)` and vice versa

5) Cross check the output after incremental changes to check algorithmic scalability

6) Start with a small problem size that reduces the execution time. 


**General tip:** Be aware of *Data Race* situation in which at least two threads access a shared variable at the same time. At least on thread tries to modify the variable. If data race happened, an incorrect result will be returned. So, make sure to validate your output against the serial version.


# Links and Resources

[Introduction to Numba](https://github.com/gpuhackathons-org/gpubootcamp/tree/master/hpc/nways/nways_labs/nways_MD/English/Python/jupyter_notebook/numba/numba_guide.ipynb)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight Systems profiler output, please download the latest version of Nsight Systems from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [Open Hackathons Resources](https://www.openhackathons.org/s/technical-resources) and join our [OpenACC and Hackathons Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.


---
## Licensing 

Copyright © 2022 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.