Before we begin, let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
!nvidia-smi

## Learning objectives
The **goal** of this lab is to:

- Learn how to run the same code on both a multicore CPU and a GPU using Fortran Standard Parallelism
- Understand steps required to make a sequential code parallel using do-concurrent constructs

We do not intend to cover:
- Detailed optimization techniques and mapping of do-concurrent constructs to CUDA Fortran


# Fortran Standard Parallelism

ISO Standard Fortran 2008 introduced the DO CONCURRENT construct to allow you to express loop-level parallelism, one of the various mechanisms for expressing parallelism directly in the Fortran language. 

Fortran developers have  been able to accelerate their programs using CUDA Fortran, OpenACC or OpenMP. Now with the support of DO CONCURRENT on GPU with NVIDIA HPC SDK, compiler automatically accelerates loops using DO CONCURRENT, allowing developers to get the benefit of acclerating on NVIDIA GPUs using ISO Standard Fortran without any extensions, directives, or non-standard libraries. You can now write standard Fortran, remaining fully portable to other compilers and systems, and still benefit from the full power of NVIDIA GPUs

For our code to make *Pair Calculation* all that’s required is expressing loops with DO CONCURRENT. The example below will introduce you to the syntax of DO CONCURRENT 

Sample vector addition codeis shown in code below:

```fortran
  subroutine vec_addition(x,y,n)
    real :: x(:), y(:)
    integer :: n, i  
    do i = 1, n 
      y(i) = x(i)+y(i)
    enddo  
  end subroutine vec_addition
```

In order to make use of ISO Fortran DO CONCURRENT we need to replace the `do` loop with `do concurrent` as shown in code below

```fortran
  subroutine vec_addition(x,y,n)
    real :: x(:), y(:)
    integer :: n, i  
    do concurrent (i = 1: n) 
      y(i) = x(i)+y(i)
    enddo  
  end subroutine vec_addition
```

By changing the DO loop to DO CONCURRENT, you are telling the compiler that there are no data dependencies between the n loop iterations. This leaves the compiler free to generate instructions that the iterations can be executed in any order and simultaneously. The compiler parallelizes the loop even if there are data dependencies, resulting in race conditions and likely incorrect results. It’s your responsibility to ensure that the loop is safe to be parallelized.

## Nested Loop Parallelism

Nested loops are a common code pattern encountered in HPC applications. A simple example might look like the following:

```fortran
  do i=2, n-1
    do j=2, m-1
      a(i,j) = w0 * b(i,j) 
    enddo
  enddo
```

It is straightforward to write such patterns with a single DO CONCURRENT statement, as in the following example. It is easier to read, and the compiler has more information available for optimization.

```fortran
  do concurrent(i=2 : n-1, j=2 : m-1)
    a(i,j) = w0 * b(i,j) 
  enddo
```

Now, lets start modifying the original code and add DO-CONCURRENT. Click on the <b>[rdf.f90](../../source_code/doconcurrent/rdf.f90)</b> link and modify `rdf.f90`. Remember to **SAVE** your code after changes, before running below cells.

### Compile and Run for Multicore

Now that we have added a DO-CONCURRENT code, let us try compile the code. We will be using NVIDIA HPC SDK for this exercise. The flags used for enabling DO-CONCURRENT are as follows:

- `stdpar` : This flag tell the compiler to enable Parallel DO-CONCURRENT for a respective target
- `stdpar=multicore` will allow us to compile our code for a multicore
- `stdpar` will allow us to compile our code for a NVIDIA GPU (Default is NVIDIA)

In [None]:
#Compile the code for muticore
!cd ../../source_code/doconcurrent && nvfortran -stdpar=multicore -Minfo -o rdf nvtx.f90 rdf.f90 -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/include -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt

If you compile this code with **-Minfo**, you can see how the compiler performs the parallelization.
```
rdf:
     80, Memory zero idiom, loop replaced by call to __c_mzero8
     92, Generating Multicore code
         92, Loop parallelized across CPU threads
```

Make sure to validate the output by running the executable and validate the output.

In [None]:
#Run the multicore code
!cd ../../source_code/doconcurrent && ./rdf && cat Pair_entropy.dat

The output entropy value should be the following:

```
s2      :    -2.452690945278331     
s2bond  :    -24.37502820694527    
```

In [None]:
#profile and see output of nsys
!cd ../../source_code/doconcurrent && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_doconcurrent_multicore ./rdf

Let's checkout the profiler's report. [Download the profiler output](../../source_code/doconcurrent/rdf_doconcurrent_multicore.qdrep) and open it via the GUI. Have a look at the example expected profiler report below:

<img src="../images/do_concurrent_multicore.jpg">


### Compile and run for Nvidia GPU

Without changing the code now let us try to recompile the code for NVIDIA GPU and rerun.
GPU acceleration of DO-CONCURRENT is enabled with the `-⁠stdpar` command-line option to NVC++. If `-⁠stdpar `is specified, almost all loops with DO-CONCURRENT are compiled for offloading to run in parallel on an NVIDIA GPU.

 **Understand and analyze** the solution present at:

[RDF Code](../../source_code/doconcurrent/SOLUTION/rdf.f90)

Open the downloaded files for inspection. 

In [None]:
#compile for Tesla GPU
!cd ../../source_code/doconcurrent && nvfortran -stdpar=gpu -Minfo -acc -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt

If you compile this code with -Minfo, you can see how the compiler performs the parallelization.
```rdf:
     80, Memory zero idiom, loop replaced by call to __c_mzero8
     92, Generating Tesla code
         92, Loop parallelized across CUDA thread blocks, CUDA threads(4) blockidx%y threadidx%y
             Loop parallelized across CUDA thread blocks, CUDA threads(32) ! blockidx%x threadidx%x
```

Make sure to validate the output by running the executable and validate the output.

In [None]:
#Run on NVIDIA GPU
!cd ../../source_code/doconcurrent && ./rdf && cat Pair_entropy.dat

The output entropy value should be the following:

```
s2      :    -2.452690945278331     
s2bond  :    -24.37502820694527 
```

In [None]:
#profile and see output of nvptx
!cd ../../source_code/doconcurrent && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o rdf_doconcurrent_gpu ./rdf

Let's checkout the profiler's report. [Download the profiler output](../../source_code/doconcurrent/rdf_doconcurrent_gpu.qdrep) and open it via the GUI. Have a look at the example expected profiler report below:

<img src="../images/do_concurrent_gpu.jpg">

If you inspect the output of the profiler closer, you can see the usage of *Unified Memory* annotated with green rectangle which was explained in previous sections.

Moreover, if you compare the NVTX marker `Pair_Calculation` (from the NVTX row) in both multicore and GPU version, you can see how much improvement you achieved. In the *example screenshot*, we were able to reduce that range from 1.57 seconds to 26 mseconds.

Feel free to checkout the [solution](../../source_code/doconcurrent/SOLUTION/rdf.f90) to help you understand better or compare your implementation with the sample solution.

# ISO Standard Fortran Analysis

**Usage Scenarios**
- DO-CONCURRENT is part of the standard language and provides a good start for accelerating code on accelerators like GPU and multicores.

**Limitations/Constraints**
It is key to understand that it is not an alternative to CUDA. *DO-CONCURRENT* provides highest portability and can be seen as the first step to porting on GPU. The general abstraction limits the optimization functionalities. For example, DO-CONCURRENT implementation is currently dependent on Unified memory. Moreover, one does not have control over thread management and that will limit the performance improvement.


**Which Compilers Support DO-CONCURRENT on GPUs and Multicore?**
1. NVIDIA GPU: As of Jan 2021 the compiler that support DO-CONCURRENT on NVIDIA GPU is from NVIDIA. 
2. x86 Multicore: Other compilers like intel compiler has an implementation on a multicore CPU.

## Post-Lab Summary

If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page.  This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below.

In [None]:
%%bash
cd ..
rm -f nways_files.zip
zip -r nways_files.zip *

**After** executing the above zip command, you should be able to download the zip file [here](../nways_files.zip). Let us now go back to parallelizing our code using other approaches.

**IMPORTANT**: Please click on **HOME** to go back to the main notebook for *N ways of GPU programming for MD* code.

-----

# <p style="text-align:center;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em"> <a href=../../../nways_MD_start.ipynb>HOME</a></p>

-----


# Links and Resources
[Do-Concurrent Guide](https://developer.nvidia.com/blog/accelerating-fortran-do-concurrent-with-gpus-and-the-nvidia-hpc-sdk/)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)


**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

## Licensing 

This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0). 