# Final Remarks

In this tutorial we took an approach where same algorithm was ported to GPU using different popular methods. Each method has it strengths and suffices a purpose for which it was created. From a developer point of view below listed are some key parameters which are crucial to any development exercise: 

- **Ease of Programming**: How much in-depth knowledge of processor architecture is required for a developer before starting to convert the code to GPU?
- **Performance**: How much effort is required to reach desirable performance on a particular architecture.
- **Portability**: To what extent does the same code run on multiple architecture? What provisions are provided by programming approach to target different platforms?
- **Support**: The overall ecosystem and support by the community.
    - Which all compilers implement the standard?
    - Which all languages are supported?
    - Which all applications make use it?
    - How easy or difficult it is to profile/debug the application?
    
Let us try to create a high level buckets for each of these parameter above with a limited scope of GPU support:

| | |  |  | 
| :--- | :--- | :--- | :--- |
| Ease of Programming | Low: Minimal architecture specific knowledge needed  | Intermediate: Mimimal changes expected in code design.  Using these along with architecture knowledge helps in better performance | High: In-Depth GPU architecture knowledge must |
| Performance  | Depends: Based on the complexity/type of application the performance may vary | High: Exposes methods to get good performance. These methods are integral part of design and exposed to programmer at various granularities | Best: Full control to developers to control parallelism and memory access |
| Portability | Integral: Part of the key objective  | Limited: Works only on specific platform | | 
| Support | Established: Proven over years and support by multiple vendors for GPU | Emerging: Gaining traction by multiple vendors for GPU  | |

There is a very thin line between these categories and within that limited scope and view we could categorize different approaches as follows:

 
| | OpenACC | OpenMP | DO-CONCURRENT | Kokkos | CUDA Laguages |
| --- | --- | --- | --- | --- | --- |
| Ease | High  | High | High  | Intermediate | Low |
| Performance  | Depends | Depends | Depends | High | Best |
| Portability | Integral  | Integral | Integral | Integral | Limited |
| Support | Established | Emerging | Emerging | Established | Established |

Below given are points that will help users as there is no one programming model that fits all needs.

## Ease of Programming
- The directive‐based OpenMP and OpenACC programming models are generally least intrusive when applied to the loops. 
- CUDA required a comparable amount of rewriting effort, in particular, to map the loops onto a CUDA grid of threads and thread blocks
- DO-CONCURRENT also required us to do minimal change by replacing the *do* loop to *do concurrent* . 
- The overhead for OpenMP, OpenACC and DO-CONCURRENT in terms of lines of code is the smallest

## Performance
While we have not gone into the details of optimization for any of these programming model the analysis provided here is based on the general design of the programming model itself.

- OpenACC and OpenMP abstract model defines a least common denominator for accelerator devices, but cannot represent architectural specifics of these devices without making the language less portable.
- DO-CONCURRENT on the other hand is more abstract and gives less control to developers to optimize the code

## Portability
We observed the same code being run on moth multicore and GPU using OpenMP, OpenACC and DO-CONCURRENT. The point we highlight here is how a programming model supports the divergent cases where developers may choose to use different directive variant to get more performance. In a real application the tolerance for this portability/performance trade-off will vary according to the needs of the programmer and application 
- OpenMP supports [Metadirective](https://www.openmp.org/spec-html/5.0/openmpsu28.html) where the developer can choose to activate different directive variant based on the condition selected.
- In OpenACC when using ```kernel``` construct, the compiler is responsible for mapping and partitioning the program to the underlying hardware. Since the compiler will mostly take care of the parallelization issues, the descriptive approach may generate performance code for specific architecture. The downside is the quality of the generated accelerated code depends significantly on the capability of the compiler used and hence the term "may".


## Support
- OpenACC implementation is present in most popular compilers like NVIDIA HPC SDK, PGI, GCC, Clang and CRAY. 
- OpenMP GPU support is currently available on limited compilers but being the most supported programming model for multicore it is matter of time when it comes at par with other models for GPU support.
- DO-CONCURRENT being part of the ISO Fortran standard is bound to become integral part of most compiler supporting parallelism. 


Parallel Computing in general has been a difficult task and requires developers not just to know a programming approach but also think in parallel. While this tutorial provide you a good start, it is highly recommended to go through Profiling and Optimization bootcamps as next steps.

-----

# <div style="text-align: center ;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em">[HOME](../../nways_MD_start.ipynb)</div>

-----

# Links and Resources
[OpenACC API guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC%20API%202.6%20Reference%20Guide.pdf)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute)

[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).