Before we begin, let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
!nvidia-smi

## Learning objectives
The **goal** of this lab is to:

- Learn how to run the same code on both a multicore CPU and a GPU using stdpar
- Understand steps required to make a sequential code parallel using stdpar constructs

We do not intend to cover:
- Detailed optimization techniques and mapping of stdpar constructs to CUDA C


# STL
If you are not familiar with STL (Standard Template Library), this section will give you a brief introduction that would be required to understand the usage of STL library for our code.

The C++ STL (Standard Template Library) is a powerful set of C++ template classes to provide general-purpose classes and functions with templates that implement many popular and commonly used algorithms and data structures like vectors, lists, queues, and stacks.

At the core of the C++ Standard Template Library are following three well-structured components 

- Containers: Containers are used to manage collections of objects of a certain kind. There are several different types of containers like dequeue, list, vector, map etc.

- Algorithms: Algorithms act on containers. They provide the means by which you will perform initialization, sorting, searching, and transforming of the contents of containers.

- Iterators: Iterators are used to step through the elements of collections of objects. These collections may be containers or subsets of containers.

For our code to make *Pair Calculation* we will be making use of ```vector``` container. The example below will introduce you to the container and how to use iterator to step through elements of vector. ```vector``` container (a C++ Standard Template) which is similar to an array with an exception that it automatically handles its own storage requirements in case it grows

For our code we will be making use of ```std::for_each``` algorithm and its sample usage is also shown in code below:

```cpp
#include <vector>
#include <algorithm>
#include <iostream>
 
//Using functor
struct Sum
{
    void operator()(int n) { sum += n; }
    int sum{0};
};
 
int main()
{
    std::vector<int> nums{3, 4, 2, 8, 15, 267};
 
    auto print = [](const int& n) { std::cout << " " << n; };
 
    std::cout << "before:";
    std::for_each(nums.cbegin(), nums.cend(), print);
    std::cout << '\n';
 
    std::for_each(nums.begin(), nums.end(), [](int &n){ n++; });
 
    // calls Sum::operator() for each number
    Sum s = std::for_each(nums.begin(), nums.end(), Sum());
 
    std::cout << "after: ";
    std::for_each(nums.cbegin(), nums.cend(), print);
    std::cout << '\n';
    std::cout << "sum: " << s.sum << '\n';
}
```

To learn more about STL you can read and execute sample codes [here](https://www.tutorialspoint.com/cplusplus/cpp_stl_tutorial.htm).


# Parallel STL
Starting with C++17, parallelism has become an integral part of the standard itself. Parallel STL is an implementation of the C++ standard library algorithms with support for execution policies, commonly called C++17.

C++17 Parallel Standard Library (stdpar) introduces parallel and vector concurrency for standard algorithms. It is important to note that stdpar is a library and not a language extension.


## std::par Execution Policies


Execution Policies define the kind of parallelism that will be applied to parallel algorithms. Most standard algorithms included in STL support execution policies. Defined below are the execution policies:

- std::execution::seq = sequential
    - This execution policy type used as a unique type to disambiguate parallel algorithm overloading and require that a parallel algorithm’s execution may not be parallelized.
- std::execution::par = parallel
    - This execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm’s execution may be parallelized
- std::execution::par_unseq = parallel + vectorized
    - This execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm’s execution may be parallelized and vectorized

Implementation of execution policies is provided by different compilers from specific vendors. For GPU parallel execution policy we will be making use of NVIDIA compiler. 


## Historical Perspective

Changes to how the call to _stl_ algorithms changed the new version of C++ standard to incorporate execution policies:

**C++98:** 
```cpp
std::sort(c.begin(), c.end()); 
```
**C++17:** 
```cpp
std::sort(std::execution::par, c.begin(), c.end());
```

We will be using the NVIDIA HPC C++ compiler, NVC++. It supports C++17, C++ Standard Parallelism (stdpar) for NVIDIA GPUs, OpenACC for multicore CPUs and NVIDIA GPUs, and OpenMP for multicore CPUs. No language extensions or non-standard libraries are required to enable GPU acceleration. All data movement between host memory and GPU device memory is performed implicitly and automatically under the control of CUDA [Unified Memory](../GPU_Architecture_Terminologies.ipynb), which means that heap memory is automatically shared between a CPU(Host) and GPU(Device). Stack memory and global memory are not shared. Below given example shows the right allocation and usage of the stdpar.

```cpp
std::vector<int> v = ...;
std::sort(std::execution::par, v.begin(), v.end()); // OK, vector allocates on heap

std::array<int, 1024> a = ...;
std::sort(std::execution::par, a.begin(), a.end()); // Fails, array stored on the stack
```

For our code we will be making use of ```std::for_each``` algorithm with support for ```std::execution::par``` execution policy

**Counting Iterator**: In our code we will also be using a special iterator ```counting_iterator```. This iterator which represents a pointer into a range of sequentially changing values. This iterator is useful for creating a range filled with a sequence without explicitly storing it in memory. Using ```counting_iterator``` saves memory capacity and bandwidth

Now, lets start modifying the original code and add stdpar. Click on the <b>[rdf.cpp](../../source_code/stdpar/rdf.cpp)</b> and <b>[dcdread.h](../../source_code/stdpar/dcdread.h)</b> links, and modify `rdf.cpp` and `dcdread.h`. Remember to **SAVE** your code after changes, before running below cells.


### Compile and Run for Multicore

Now that we have added a stdpar code, let us try compile the code. We will be using NVIDIA HPC SDK for this exercise. The flags used for enabling parallel STL target offloading are as follows:

- `stdpar` : This flag tell the compiler to enable Parallel STL for a respective target
- `stdpar=multicore` will allow us to compile our code for a multicore
- `stdpar` will allow us to compile our code for a NVIDIA GPU (Default is NVIDIA)

In [None]:
#Compile the code for muticore
!cd ../../source_code/stdpar && nvc++ -std=c++17 -stdpar=multicore \
-I/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/include \
-o rdf rdf.cpp -fopenmp \
-L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt

Make sure to validate the output by running the executable and validate the output.

In [None]:
#Run the multicore code
!cd ../../source_code/stdpar && ./rdf && cat Pair_entropy.dat

The output entropy value should be the following:

```
s2 value is -2.43191
s2bond value is -3.87014
```

In [None]:
#profile and see output of nsys
!cd ../../source_code/stdpar && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_stdpar_multicore ./rdf

Let's checkout the profiler's report. [Download the profiler output](../../source_code/stdpar/rdf_stdpar_multicore.qdrep) and open it via the GUI. Have a look at the example expected profiler report below:

<img src="../images/stdpar_multicore.png">


### Compile and run for Nvidia GPU

Without changing the code now let us try to recompile the code for NVIDIA GPU and rerun.
GPU acceleration of C++ Parallel Algorithms is enabled with the `-⁠stdpar` command-line option to NVC++. If `-⁠stdpar `is specified, almost all algorithms that use a parallel execution policy are compiled for offloading to run in parallel on an NVIDIA GPU.

 **Understand and analyze** the solution present at:

[RDF Code](../../source_code/stdpar/SOLUTION/rdf.cpp)

[File Reader](../../source_code/stdpar/SOLUTION/dcdread.h)

Open the downloaded files for inspection. 

In [None]:
#compile for Tesla GPU
!cd ../../source_code/stdpar && nvc++ -std=c++17 -DUSE_COUNTING_ITERATOR  -stdpar=gpu -o rdf rdf.cpp 

Make sure to validate the output by running the executable and validate the output.

In [None]:
#Run on NVIDIA GPU
!cd ../../source_code/stdpar && ./rdf && cat Pair_entropy.dat

The output entropy value should be the following:

```
s2 value is -2.43191
s2bond value is -3.87014
```

In [None]:
#profile and see output of nvptx
!cd ../../source_code/stdpar && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o rdf_stdpar_gpu ./rdf

Let's checkout the profiler's report. [Download the profiler output](../../source_code/stdpar/rdf_stdpar_gpu.qdrep) and open it via the GUI. Have a look at the example expected profiler report below:

<img src="../images/stdpar_gpu.png">

If you inspect the output of the profiler closer, you can see the usage of *Unified Memory* annotated with green rectangle which was explained in previous sections.

Moreover, if you compare the NVTX marker `Pair_Calculation` (from the NVTX row) in both multicore and GPU version, you can see how much improvement you achieved. In the *example screenshot*, we were able to reduce that range from 1.52 seconds to 188.4 mseconds.

Feel free to checkout the [solution](../../source_code/stdpar/SOLUTION/rdf.cpp) to help you understand better or compare your implementation with the sample solution.

# stdpar Analysis

**Usage Scenarios**
- stdpar is part of the standard language and provides a good start for accelerating code on accelerators like GPU and multicores.

**Limitations/Constraints**
1. It is key to understand that *std::par* is not an alternative to CUDA. *std:par* provides highest portability and can be seen as the first step to porting on GPU. The general abstraction limits the optimization functionalities. For example, stdpar implementation is currently dependent on Unified memory. Moreover, one does not have control over thread management and that will limit the performance improvement.
2. C++ constructs can only be used in the code using C++17 features and may not work for legacy codes.


**Which Compilers Support stdpar on GPUs and Multicore?**
1. NVIDIA GPU: As of Jan 2021 the compiler that support std::par on NVIDIA GPU is from NVIDIA. 
2. x86 Multicore: gcc has an implementation on a multicore CPU which is based on Intel TBB in the backend.

## Post-Lab Summary

If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page.  This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below.

In [None]:
%%bash
cd ..
rm -f nways_files.zip
zip -r nways_files.zip *

**After** executing the above zip command, you should be able to download the zip file [here](../nways_files.zip). Let us now go back to parallelizing our code using other approaches.

**IMPORTANT**: Please click on **HOME** to go back to the main notebook for *N ways of GPU programming for MD* code.

-----

# <p style="text-align:center;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em"> <a href=../../../nways_MD_start.ipynb>HOME</a></p>

-----


# Links and Resources
[stdpar Guide](https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9770-c++17-parallel-algorithms-for-nvidia-gpus-with-pgi-c++.pdf)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)


**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

## Licensing 

This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0). 