# Final Remarks

In this tutorial we saw how a single algorithm was ported to GPU using two well known approaches(Numba and CuPy).
Each approach has it strengths and suffices a purpose for which it was created. From a developer perspective the following are metrics crucial to any development exercise: 

   1. **Ease of Programming**: This implies the level of processor architectural knowledge required for a developer before starting to convert the serial code to GPU version.
   2. **Performance**: It is measure of how much effort is required to reach desirable performance on a particular architecture.
   3. **Portability**: To what extent does the same code run on multiple architecture? What provisions are provided by programming approach to target different platforms?
   4. **Support**: The overall ecosystem and support by the community.
    - How easy or difficult it is to profile/debug the application?
    
Let us try to create a high level buckets for each of these parameter above with a limited scope of GPU support:

| | |  |  | 
| :--- | :--- | :--- | :--- |
| Ease of Programming | Low: Minimal architecture specific knowledge needed  | Intermediate: Mimimal changes expected in code design.  Using these along with architecture knowledge helps in better performance | High: In-Depth GPU architecture knowledge must |
| Performance  | Depends: Based on the complexity/type of application the performance may vary | High: Exposes methods to get good performance. These methods are integral part of design and exposed to programmer at various granularities | Best: Full control to developers to control parallelism and memory access |
| Portability | Integral: Part of the key objective  | Limited: Works only on specific platform | | 
| Support | Established: Proven over years and support by multiple vendors for GPU | Emerging: Gaining traction by multiple vendors for GPU  | |

There is a very thin line between these categories and within that limited scope and view we could categorize different approaches as follows:



 
| Metrics          | Python CuPy | Python Numba | CUDA Laguages |
| ---              | ---         | ---          |    ---        |
| **Ease**         | Depends     | Intermediate |    Low        |
| **Performance**  | High        |  High        |    Best       |
| **Portability**  | Integral    |  Integral    |   Limited     |
| **Support**      | Emerging    | Established  |  Established  |

The following are points to broaden users understanding as there is no "one-size fits all" programming model.

## Ease of Programming
- The Python CuPy programming model is problem-based that is it depends on the type of task engaged. A major challenge in Python CuPy is the Raw Kernel which has to be written in CUDA C form, however, performance is increased and access to thread IDs which are not possible with other Kernel class in CuPy is possible. With peculiarity to the serial code, access to thread IDs (because of non-sequencial index access within array) is required in solving task, hence, the CuPy Raw Kernel was used. Therefore, comparable amount of rewriting effort, in particular, to map the loops onto a CUDA grid of blocks and thread blocks

- Python Numba programming model uses the CUDA C programming paradigm with python semantics. Thus, moderate effort is required to map the loops onto a grid of blocks and thread of blocks since syntax in numba python friendly form. 

## Performance
While we have not gone into the details of optimization for any of these programming model the analysis provided here is based on the general design of the programming model itself.
- Python CuPy and Numba code optimazation is dependent on the logic used in terms of data movement, thread block management and shared memory. However, with emphasis on the lab task, the CuPy approach is expected to have better performance than the Numba approach. 

## Portability
We observed the same code being run on moth multicore and GPU using CuPy and Numba. The point we highlight here is how a programming model supports the divergent cases where developers may choose to use different kernel class or function decorators to get more performance. In a real application the tolerance for this portability/performance trade-off will vary according to the needs of the programmer and application 



## Support
- CuPy and Numba libraries are well documented and the developers support on GitHub are excellent 
- CuPy implementation is present in the RAPIDS package via conda [here](https://rapids.ai/start.html). 
- Numba is well support in anaconda package.
- **CUDA Python Ecosystem**: NVIDIA has recently shown support towards simplification of developer's experience with improved Python code portability and compatibility. The goal is to help unify the Python CUDA ecosystem with a single standard set of low-level interfaces, providing full coverage of and access to the CUDA host APIs from Python. The ecosystem to allows interoperability among different accelerated libraries and easy for Python developers to use NVIDIA GPUs. The initial release of CUDA Python includes Cython and Python wrappers for the CUDA Driver and runtime APIs. You can read more here:
    - [Python Ecosystem](https://developer.nvidia.com/blog/unifying-the-cuda-python-ecosystem/)
    - [CUDA Python Public Preview](https://developer.nvidia.com/cuda-python)
    - [GPU-Accelerated Computing with Python Numba](https://developer.nvidia.com/how-to-cuda-python)


Parallel Computing in general has been a difficult task and requires developers not just to know a programming approach but also think in parallel. While this tutorial provide you a good start, it is highly recommended to go through Profiling and Optimization bootcamps as next steps.

-----

# <div style="text-align: center ;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em">[HOME](../../nways_MD_start_python.ipynb)</div>

-----

# Links and Resources

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute)

[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).

--- 

## Licensing 

This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0). 