Before we begin, let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
!nvidia-smi

## Learning objectives
The **goal** of this lab is to:

- Learn how to run the same code on both a multicore CPU and a GPU using the OpenACC programming model
- Understand the key directives and steps involved in making a sequential code parallel
- Learn how to interpret the compiler feedback
- Learn and understand the Nsight systems profiler report

We do not intend to cover:
- Optimization techniques in details


# OpenACC Directives
Using OpenACC directives will allow us to parallelize our code without explicitly alter our code. What this means is that, by using OpenACC directives, we can have a single code that will function as both a sequential code and a parallel code.

### OpenACC Syntax

```!$acc <directive> <clauses> ```

**!$acc** in Fortran is what's known as a "compiler hint." These are very similar to programmer comments, however, the compiler will actually read our comments. They are a way for the programmer to "guide" the compiler, without running the chance damaging the code. If the compiler does not understand the comment, it can ignore it, rather than throw a syntax error.

**acc** specifies that this is an OpenACC related directive that will folow. Any non-OpenACC compiler will ignore this. 

**directives** are commands in OpenACC that will tell the compiler to do some action. For now, we will only use directives that allow the compiler to parallelize our code.

**clauses** are additions/alterations to our directives. These include (but are not limited to) optimizations. One way to think about it: directives describe a general action for our compiler to do (such as, paralellize our code), and clauses allow the programmer to be more specific (such as, how we specifically want the code to be parallelized).

## 3 Key Directives

OpenACC consists of 3 key types of directives responsible for **parallel execution**, **managing data movement** and **optimization**

We will be covering the parallel execution directive in this lab. The data directive is part of the additional section and can be tried out in the end.

### Parallel and Loop Directives


There are three directives we will cover in this lab: `parallel`, `loop`, and `parallel loop`. Once we understand all three of them, you will be tasked with parallelizing **Pair Calculation** with your preferred directive 

The parallel directive may be the most straight-forward of the directives. It will mark a region of the code for parallelization (this usually only includes parallelizing a single **for** loop.) Let's take a look:

```fortran
!$acc parallel loop
    do i=1,N
        < loop code >
    enddo
```

We may also define a "parallel region". The parallel region may have multiple loops (though this is often not recommended!) 

```fortran
!$acc parallel
    !$acc loop
    do i=1,N
        < loop code >
    enddo
!$acc end parallel
```

`!$acc parallel loop` will mark the next loop for parallelization. It is extremely important to include the `loop`, otherwise you will not be parallelizing the loop properly. The parallel directive tells the compiler to "redundantly parallelize" the code. The `loop` directive specifically tells the compiler that we want the loop parallelized. 

Let's look at an example of why the loop directive is so important. The `parallel` directive tells the compiler to create somewhere to run parallel code. OpenACC calls that somewhere a `gang`, which might be a thread on the CPU or maying a CUDA threadblock or OpenCL workgroup. It will choose how many gangs to create based on where you're running, only a few on a CPU (like 1 per CPU core) or lots on a GPU (1000's possibly). Gangs allow OpenACC code to scale from small CPUs to large GPUs because each one works completely independently of each other gang. That's why there's a space between gangs in the images below.

<img src="../images/parallel1f.png" width="80%" height="80%">

---

<img src="../images/parallel2f.png" width="80%" height="80%">

There's a good chance that we don't want my loop to be run redundantly in every gang though, that seems wasteful and potentially dangerous. Instead we want to instruct the compiler to break up the iterations of my loop and to run them in parallel on the gangs. To do that, we simply can add a `loop` directive to the interesting loops. This instructs the compiler that we want my loop to be parallelized and promises to the compiler that it's safe to do so. Now that we have both `parallel` and `loop`, things loop a lot better (and run a lot faster). Now the compiler is spreading my loop iterations to all of my gangs, but also running multiple iterations of the loop at the same time within each gang as a *vector*. Think of a vector like this, we have 10 numbers that I want to add to 10 other numbers (in pairs). Rather than looking up each pair of numbers, adding them together, storing the result, and then moving on to the next pair in-order, modern computer hardware allows me to add all 10 pairs together all at once, which is a lot more efficient. 

<img src="../images/parallel3f.png" width="80%" height="80%">

The `acc parallel loop` directive is both a promise and a request to the compiler. The programmer is promising that the loop can safely be parallelized and am requesting that the compiler do so in a way that makes sense for the machine we target. The compiler may make completely different decisions if we are compiling for a multicore CPU than it would for a GPU and that's the idea. OpenACC enables programmers to parallelize their codes without having to worry about the details of how best to do so for every possible machine. 



### Atomic Construct

In the code you will also require one more construct which will help you in getting the right results. OpenACC atomic construct ensures that a particular variable is accessed and/or updated atomically to prevent indeterminate results and race conditions. In other words, it prevents one thread from stepping on the toes of other threads due to accessing a variable simultaneously, resulting in different results run-to-run. For example, if we want to count the number of elements that have a value greater than zero, we could write the following:


```fortran
if(r<cut)then
  !$acc atomic
    cnt = cnt + 1
```


Now, lets start modifying the original code and add the OpenACC directives. Click on the <b>[rdf.f90](../../source_code/openacc/rdf.f90)</b> link and modify `rdf.f90`. Remember to **SAVE** your code after changes, before running below cells.

### Compile and Run for Multicore

After adding OpenACC directives now let us try to compile the code. For compiling we will be making use of these additional flags:

**-Minfo** : This flag will give us feedback from the compiler about code optimizations and restrictions.

**-Minfo=accel** will only give us feedback regarding our OpenACC parallelizations/optimizations.

**-Minfo=all** will give us all possible feedback, including our parallelization/optimizations, sequential code optimizations, and sequential code restrictions.

**-ta** : This flag allows us to compile our code for a specific target parallel hardware. Without this flag, the code will be compiled for sequential execution.

          -ta=multicore will allow us to compile our code for a multicore CPU.
          
          -ta=tesla will allow us to compile our code for a NVIDIA GPU

In [None]:
#Compile the code for multicore
!cd ../../source_code/openacc && nvfortran -acc -ta=multicore -Minfo=accel -o rdf nvtx.f90 rdf.f90 -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/include -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt

Let's inspect part of the compiler feedback and see what it's telling us (your compiler feedback will be similar to the below screenshot).

```
	rdf:
     97, Generating Multicore code
         98, !$acc loop gang
     99, Loop carried dependence of g prevents parallelization
         Loop carried backward dependence of g prevents vectorization
```

You can see from *Line 97*, it is generating a multicore code `97, Generating Multicore code`. It is very important to inspect the feedback to make sure the compiler is doing what you have asked of it. 

Let's run the executable and validate the output first. Then, profile the code.

In [None]:
#Run the multicore code and check the output
!cd ../../source_code/openacc && ./rdf && cat Pair_entropy.dat

The output should be the following:

```
s2      :    -2.452690945278331     
s2bond  :    -24.37502820694527    
```

In [None]:
#profile and see output of nvptx
!cd ../../source_code/openacc && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_multicore ./rdf

Let's checkout the profiler's report. [Download the profiler output](../../source_code/openacc/rdf_multicore.qdrep) and open it via the GUI. From the timeline view, checkout the NVTX markers displays as part of threads. **Why are we using NVTX?** Please see the section on [Using NVIDIA Tools Extension (NVTX)](../../../../../profiler/English/jupyter_notebook/profiling.ipynb#Using-NVIDIA-Tools-Extension-(NVTX)).

From the timeline view, right click on the nvtx row and click the "show in events view". Now you can see the nvtx statistic at the bottom of the window which shows the duration of each range. 

<img src="../images/nvtx_multicore.jpg" width="100%" height="100%">

You can also checkout NVTX statistic from the terminal console once the profiling session ended. From the NVTX statistics, you can see most of the execution time is spend in `Pair_Calculation`. This is a function worth checking out.

You can also compare the NVTX ranges with the serial version (see [screenshot](../serial/rdf_overview.ipynb))

### Compile and Run on a GPU

Without changing the code, now let us try to recompile the code for NVIDIA GPU and rerun. The only difference is now we set **-ta=tesla:managed** instead of **-ta=multicore** . **Understand and analyze** the code present at:

[RDF Code](../../source_code/openacc/rdf.f90)


Open the downloaded files for inspection. Once done, compile the code by running the below cell. View the compiler feedback (enabled by adding `-Minfo=accel` flag) and investigate the compiler feedback for the OpenACC code. The compiler feedback provides useful information about applied optimizations.

In [None]:
#compile for Tesla GPU
!cd ../../source_code/openacc && nvfortran -acc -ta=tesla:managed,lineinfo  -Minfo=accel -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt

Let's inspect part of the compiler feedback and see what it's telling us (your compiler feedback will be similar to the below screenshot).

```
	rdf:
     97, Generating Tesla code
         98, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         99, !$acc loop seq
     97, Generating implicit copyin(y(iconf,1:natoms),z(iconf,1:natoms),x(iconf,1:natoms)) [if not already present]
         Generating implicit copy(g(:)) [if not already present]
     99, Complex loop carried dependence of g prevents parallelization
         Loop carried dependence of g prevents parallelization
         Loop carried backward dependence of g prevents vectorization 
```

- Using `-ta=tesla:managed`, instruct the compiler to build for an NVIDIA Tesla GPU using "CUDA Managed Memory"
- Using `-Minfo` command-line option, we will see all output from the compiler. In this example, we use `-Minfo=accel` to only see the output corresponding to the accelerator (in this case an NVIDIA GPU).
- The line starting with 97, shows we created a parallel OpenACC loop. This loop is made up of gangs (a grid of blocks in CUDA language) and vector parallelism (threads in CUDA language) with the vector size being 128 per gang. `98, $acc loop gang, vector(128) ! blockidx%x threadidx%x`
- The rest of the information concerns data movement. Compiler detected possible need to move data and handled it for us. We will get into this later in this lab.

It is very important to inspect the feedback to make sure the compiler is doing what you have asked of it. Now, let's profile the code.

In [None]:
#profile and see output of nvptx
!cd ../../source_code/openacc && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o rdf_gpu ./rdf

Let's checkout the profiler's report. [Download the profiler output](../../source_code/openacc/rdf_gpu.qdrep) and open it via the GUI. From the "timeline view" on the top pane, double click on the "CUDA" from the function table on the left and expand it. Zoom in on the timeline and you can see a pattern similar to the screenshot below. The blue boxes are the compute kernels and each of these groupings of kernels is surrounded by green and red/purple boxes (annotated with purple color) representing data movements.

<img src="../images/parallel_timeline.jpg" width="80%" height="80%">

Let's hover your mouse over the CUDA row (underlined with blue color in the below screenshot) and expand it till you see both kernels and memory row. In the below screenshot you can see the NVTX ranges in the "Events View" at the bottom of the timeline view window. You can right click on each row from the function table on the left (top window) and click on "Show in Events View" and checkout the detail related to that row (similar to the NVTX example in the below screenshot).

<img src="../images/parallel_expand.jpg" width="80%" height="80%">

Nsight systems captures information about OpenACC execution in the profiled process. From the timeline tree, each thread that uses OpenACC shows the OpenACC trace information. To view this, you would need to click on the OpenACC API call to see the correlation with the underlying CUDA API calls. If the OpenACC API results in GPU works, that will also be highlighted.

<img src="../images/openacc correlation.jpg" width="80%" height="80%">

Moreover, if you hover on a particular OpenACC construct, you can see details about that construct.

<img src="../images/openacc_construct.jpg" width="80%" height="80%">

Feel free to checkout the [solution](../../source_code/openacc/SOLUTION/rdf_parallel_directive.f90) to help you understand better.

## OpenACC Analysis

**Usage Scenarios**

There are multiple reasons to make use of Directive-based programming, but from an application developer point of view the key usage motivation is that it keeps the code readable/maintainable. Below are some usage scenarios under which OpenACC can be preferred:
- Legacy codes with sizeable codebase needs to be ported to GPUs with minimal code changes to sequential code.
- Developers want to see if the code structure favors GPU SIMD/SIMT style or as we say test the waters before moving a large piece of code to a GPU.
- Portable performance is an important feature for directive programming approach and OpenACC specification has rich features to achieve the same for target accelerators like GPU.

Applications like Ansys Fluent, Gaussian, and VASP make use of OpenACC for adding parallelism. These applications are listed among top 5 applications which consume most of the compute clock cycles on supercomputers worldwide according to a report by [Intersect 360](http://www.intersect360.com/features-1/new-reports-on-gpu-and-accelerated-computing-from-intersect360).

**Limitations/Constraints**

Directive based programming model like OpenACC depends on a compiler to understand and convert your sequential code to CUDA constructs. While OpenACC compiler have evolved over time, it cannot match the best performance that say using CUDA C constructs directly can give. Things like minimizing register pressure, using specialized memory like texture etc are some of the examples. 

It is key to understand that OpenACC is not an alternative to CUDA. In fact, OpenACC can be seen as the first step in GPU porting with opportunity to port only the most critical kernel to CUDA. Developers can make use of interoperability techniques for combining OpenACC and CUDA in codes. For more details you can refer to [Interoperability](https://devblogs.nvidia.com/3-versatile-openacc-interoperability-techniques/)

**Compilers Support for OpenACC**

As of March 2020 here are the compilers that support OpenACC:

| Compiler | Latest Version | Maintained by | Full or Partial Support |
| --- | --- | --- | --- |
| HPC SDK| 21.3 | NVIDIA HPC SDK | Full 2.5 spec |
| GCC | 10 | Mentor Graphics, SUSE | 2.0 spec, Limited Kernel directive support, No Unified Memory |
| CCE| latest | Cray | 2.0 Spec | 



## Optional Exercise

### Kernel Directive 

The parallel directive leaves a lot of decisions up to the programmer. The programmer will decide what is and isn't parallelizable. The programmer will also have to provide all of the optimizations - the compiler assumes nothing. If any mistakes happen while parallelizing the code ( ignoring the data races etc.), it will be up to the programmer to identify and correct them.

Another directive "kernels" is the exact opposite in all of these regards. The key difference between the two is as follows:

The **parallel directive** gives a lot of control to the programmer. The programmer decides what to parallelize, and how it will be parallelized. Any mistakes made by the parallelization is at the fault of the programmer. It is recommended to use a parallel directive for each loop you want to parallelize.

The **kernels directive** leaves majority of the control to the compiler. The compiler will analyze the loops, and decide which ones to parallelize. It may refuse to parallelize certain loops, but the programmer can override this decision. You may use the kernels directive to parallelize large portions of code, and these portions may include multiple loops.
We do not plan to cover this directive in details in the current lab.

Use the kernels directive and observe any performance difference between **parallel** and **kernels** directives.
Sample usage of kernel directives is given as follows:

```fortran
!$acc kernels
    do i=1,N
        < loop code >
    enddo
```
Now, lets start modifying the original code and add the OpenACC kernel directive. From the top menu, click on *File*, and *Open* `rdf.f90` from the current directory at `Fortran/source_code/openacc` directory. Remember to **SAVE** your code after changes, before running below cells.

In [None]:
#compile for Tesla GPU
!cd ../../source_code/openacc && nvfortran -acc -ta=tesla:managed,lineinfo  -Minfo=accel -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt

Let's run the executable and validate the output first. 

In [None]:
#Run the multicore code and check the output
!cd ../../source_code/openacc && ./rdf && cat Pair_entropy.dat

The output should be the following:

```
s2      :    -2.452690945278331     
s2bond  :    -24.37502820694527 
```

If you only replaced the parallel directive with kernels (meaning only wrapping the loop with `!$acc kernels`), then the compiler feedback will look similar to below:

```
rdf:
     97, Generating implicit copyin(y(iconf,:),z(iconf,:),x(iconf,:)) [if not already present]
         Generating implicit copy(g(:)) [if not already present]
     99, Loop carried dependence due to exposed use of g(:) prevents parallelization
         Accelerator serial kernel generated
         Generating Tesla code
         99, !$acc loop seq
        101, !$acc loop seq
    101, Loop carried dependence due to exposed use of g(:) prevents parallelization
```

The line starting with 99, shows we created a serial kernel and the following loops will run in serial. When we use kernel directives, we let the compiler make decisions for us. In this case, the compiler thinks loop are not safe to parallelise due to dependency.

### OpenACC Independent Clause

In cases as such, we need to inform the compiler that the loop is safe to parallelise so it can generate kernels. To specify that loop iterations are data independent, we need to overwrite the compiler dependency analysis (Note: this is implied for *parallel loop*).

```fortran

!$acc kernels
    do i=1,N
       !$ acc loop independent
       do j=1,N
          < loop code >
       end do
    enddo
```
Now, lets start modifying the original code and add the OpenACC kernel directive. From the top menu, click on *File*, and *Open* `rdf.f90` from the current directory at `Fortran/source_code/openacc` directory. Remember to **SAVE** your code after changes, before running below cells.

In [None]:
#compile for Tesla GPU
!cd ../../source_code/openacc && nvfortran -acc -ta=tesla:managed,lineinfo  -Minfo=accel -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt

Let's inspect the compiler feedback and see if it does what we expect it to do now. You should get a compiler feedback similar to the below:

```
rdf:
     97, Generating implicit copyin(y(iconf,:),z(iconf,:),x(iconf,:)) [if not already present]
         Generating implicit copy(g(:)) [if not already present]
     99, Loop is parallelizable
    101, Loop is parallelizable
         Generating Tesla code
         99, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        101,   ! blockidx%x threadidx%x auto-collapsed
```

We can see that the compiler knows that the loop is parallelisable (`99, Loop is parallelizable`). Note that the loop is parallelized using vector(128) which that the compiler generated instructions for chunk of data of length 128 (vector size being 128 per gang) `99, acc loop gang, vector(128) /* blockIdx%x threadIdx%x */`

Let's profile the code now.

In [None]:
#profile and see output of nvptx
!cd ../../source_code/openacc && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o rdf_kernel ./rdf

Let's checkout the profiler's report. [Download the profiler output](../../source_code/openacc/rdf_kernel.qdrep) and open it via the GUI. Checkout the OpenACC row and hover over OpenACC constructs to see if the detail looks different from when you use parallel directives. Compare the profiler report with the previous section.

Feel free to checkout the [solution](../../source_code/openacc/SOLUTION/rdf_kernel_directive.f90) to help you understand better.

### Data Directive 

In this lab so far, we added OpenACC parallel and loop directives and relied on a feature called [CUDA Managed Memory](../GPU_Architecture_Terminologies.ipynb) to deal with the separate CPU & GPU memories for us. Just adding OpenACC to our  loop we achieved a considerable performance boost. However, managed memory is not compatible with all GPUs or all compilers and it sometimes performs worse than programmer-defined memory management. Also when programming for a GPU, based on the application type handling data management explicitly between the CPU and GPU may result into better performance.

Let's inspect the profiler report from the previous section when we used managed memory with parallel directives. From the "timeline view" on the top pane, double click on the "CUDA" from the function table on the left and expand it. Zoom in on the timeline and you can see a pattern similar to the screenshot below. The blue boxes are the compute kernels and each of these groupings of kernels is surrounded by purple and teal boxes (annotated with green color) representing data movements.

<img src="../images/parallel_unified.jpg">

What this graph is showing us is that we're doing data movement between GPU and CPU in the start and end of Pair calculation. The compiler feedback we collected earlier tells us quite a bit about data movement too. If we look again at the compiler feedback from earlier, we see the following.

```
rdf:
     97, Generating implicit copyin(y(iconf,:),z(iconf,:),x(iconf,:)) [if not already present]
         Generating implicit copy(g(:)) [if not already present]
     99, Loop is parallelizable
    101, Loop is parallelizable
         Generating Tesla code
         99, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        101,   ! blockidx%x threadidx%x auto-collapsed
```

The compiler feedback is telling us that the compiler has inserted data movement around our parallel region at line 97 which copies the `g` array in and out of the GPU memory and also copies `x`, `y` and `z` to the GPU memory. The important part to note here is the word *implicit* which means we have not explicitly added teh data clauses but compiler can observe that the data needs to be copied implicitly. 

The compiler can only work with the information we provide. It knows we need all those arrays on on the GPU for the accelerated section within the  `pair_gpu` function, but we didn't tell the compiler anything about what happens to the data outside of those sections. Without this knowledge, the compiler has to copy the full arrays to the GPU and back to the CPU for each accelerated section. This is a lot of unnecessary data transfers. 

Ideally, we would want to move the data to the GPU at the beginning, and only transfer it back to the CPU at the end (if needed). If we do not need to copy any data back to the CPU, then we only need to create space on the device (GPU) for an array. 

We need to give the compiler information about how to reduce the extra and unnecessary data movement. By adding OpenACC `data` directive to a structured code block, the compiler will know how to manage data according to the clauses. The following sections explains how to use data clauses in your program. For information on the data directive clauses, please visit [OpenACC 3.0 Specification](https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC.3.0.pdf).


**Using OpenACC Data Clauses**

Data clauses allow the programmer to specify data transfers between the host and device (or in our case, the CPU and the GPU). Let's look at an example where we do not use a data clause.

```fortran
allocate(A(N))

  !$acc parallel loop
  do i=1,100
    A(i) = 0
  enddo
```

We have allocated an array A outside of our parallel region. This means that A is allocated in the CPU memory. However, we access A inside of our loop, and that loop is contained within a parallel region. Within that parallel region, A(i) is attempting to access a memory location within the GPU memory. We didn't explicitly allocate A on the GPU, so one of two things will happen.

1. The compiler will understand what we are trying to do, and automatically copy A from the CPU to the GPU.
2. The program will check for an array A in GPU memory, it won't find it, and it will throw an error.
Instead of hoping that we have a compiler that can figure this out, we could instead use a data clause.

```fortran
allocate(A(N))

  !$acc parallel loop copy(A(1:N))
  do i=1,100
    A(i) = 0
  enddo
```
The image below offers step-by-step example of using the copy clause.

<img src="../images/openacc_copyclause.png" width="80%" height="80%">

Of course, we might not want to copy our data both to and from the GPU memory. Maybe we only need the array's values as inputs to the GPU region, or maybe it's only the final results we care about, or perhaps the array is only used temporarily on the GPU and we don't want to copy it either directive. The following OpenACC data clauses provide a bit more control than just the `copy` clause.

* `copyin` - Create space for the array and copy the input values of the array to the device. At the end of the region, the array is deleted without copying anything back to the host.
* `copyout` - Create space for the array on the device, but don't initialize it to anything. At the end of the region, copy the results back and then delete the device array.
* `create` - Create space of the array on the device, but do not copy anything to the device at the beginning of the region, nor back to the host at the end. The array will be deleted from the device at the end of the region.
* `present` - Don't do anything with these variables. I've put them on the device somewhere else, so just assume they're available.

You may also use them to operate on multiple arrays at once, by including those arrays as a comma separated list.

```fortran
!$acc parallel loop copy( A(1:N), B(1:M), C(1:Q) )
```

You may also use more than one data clause at a time.

```cpp
!$acc parallel loop create( A(1:N) ) copyin( B(1:M) ) copyout( C(1:Q) )
```

Let us try adding a data clause to our code and observe any performance differences between the two. 
**Note: We have removed the managed clause in order to handle data management explicitly.**

From the top menu, click on *File*, and *Open* `rdf.f90` from the current directory at `Fortran/source_code/openacc` directory. Remember to **SAVE** your code after changes, before running below cells.

In [None]:
#compile for Tesla GPU without managed memory
!cd ../../source_code/openacc && nvfortran -acc -ta=tesla,lineinfo -Minfo=accel -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt

Let us start inspecting the compiler feedback and see if it applied the optimizations. Below is the screenshot of expected compiler feedback after adding the `data` directives. 

```
rdf:
     95, Generating copy(g(:)) [if not already present]
         Generating copyin(y(y$sd8:(y$sd8-1)+y$sd8,y$sd8:(y$sd8-1)+y$sd8),z(z$sd7:(z$sd7-1)+z$sd7,z$sd7:(z$sd7-1)+z$sd7),x(x$sd9:(x$sd9-1)+x$sd9,x$sd9:(x$sd9-1)+x$sd9)) [if not already present]
     98, Generating Tesla code
         99, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
        100, !$acc loop seq
    100, Loop carried dependence of g prevents parallelization
         Loop carried backward dependence of g prevents vectorization
```

You can see that on line 95, compiler is generating default present for `g2`, `x`,`z`, and `y` arrays. In other words, it is assuming that data is present on the GPU and it only copies data to the GPU only if the data do not exist. Another key observation also is removal of the work *implicity copy* as we have added the data clauses. Also the data sizes are automatically calculated by the compiler here which we can also addionally give to compiler if needed. 


Make sure to validate the output by running the executable and validate the output. 

In [None]:
#Run the multicore code and check the output
!cd ../../source_code/openacc && ./rdf && cat Pair_entropy.dat

The output should be the following:

```
s2      :    -2.452690945278331     
s2bond  :    -24.37502820694527 
```

In [None]:
#profile and see output without managed memory
!cd ../../source_code/openacc && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o rdf_no_managed ./rdf

Let's checkout the profiler's report. [Download the profiler output](../../source_code/openacc/rdf_no_managed.qdrep) and open it via the GUI. Have a look at the example expected profiler report below:

<img src="../images/parallel_data.jpg">

Have a look at the data movements annotated with green color and compare it with the previous versions. We have accelerated the application and reduced the execution time by eliminating the unnecessary data transfers between CPU and GPU.

Feel free to checkout the [solution](../../source_code/openacc/SOLUTION/rdf_data_directive.f90) to help you understand better.

## Post-Lab Summary

If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page.  This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below.

In [None]:
%%bash
cd ..
rm -f nways_files.zip
zip -r nways_files.zip *

**After** executing the above zip command, you should be able to download the zip file [here](../nways_files.zip). Let us now go back to parallelizing our code using other approaches.
<!--
**IMPORTANT**: If you would like to continue and optimize this application further with OpenACC, please click on the **NEXT** button, otherwise click on **HOME** to go back to the main notebook for *N ways of GPU programming for MD* code.
-->

**IMPORTANT**: Please click on the **HOME** button to go back to the main notebook for *N ways of GPU programming for MD* code.

-----

# <p style="text-align:center;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em"> <a href=../../../nways_MD_start.ipynb>HOME</a></p>

-----

<!-- <p style="text-align:center;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em"> <a href=../../../nways_MD_start.ipynb>HOME</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="float:center"> <a href=nways_openacc_opt.ipynb>NEXT</a></span> </p>
-->


# Links and Resources
[OpenACC API guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC%20API%202.6%20Reference%20Guide.pdf)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute)

[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

## Licensing 

This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0). 