# OpenACC Acceleration 
Let's execute the cell below to display information about the GPUs running on the server by running the `nvidia-smi` command, which ships with the Nvidia GPU Drivers that we will be using. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [19]:
!nvidia-smi

Wed May 18 10:55:58 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                   On |
| N/A   25C    P0    48W / 400W |     20MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------

Since the code will be run on Multicore as well try running the cell below and get details of the nnumber of core and CPU architecure on the system

In [None]:
!cat /proc/cpuinfo

## Copy and Compile the Serial code

Before start modifying the serial code, let's make a copy of the serial code and rename it.

In [20]:
!cp ../source_code/serial/* ../source_code/openacc

In [21]:
!cd ../source_code/openacc && make clean && make

rm -f arraymalloc.o boundary.o cfd.o cfdio.o jacobi.o cfd velocity.dat colourmap.dat cfd.plt core
nvc++ -lm -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/include -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt -c arraymalloc.cpp
nvc++ -lm -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/include -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt -c boundary.cpp
nvc++ -lm -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/include -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt -c cfd.cpp
nvc++ -lm -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/include -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt -c cfdio.cpp
    int nvel, nrgb;
        ^

    int nvel, nrgb;
              ^

nvc++ -lm -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/include -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt -c jacobi.cpp
nvc++ -lm -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/include -L/opt/n

## Run the Serial code

In [22]:
!cd ../source_code/openacc && ./cfd 64 500

Scale Factor = 64, iterations = 500
Irrotational flow
Running CFD on 2048 x 2048 grid in serial

Starting main loop...


... finished
After 500 iterations, the error is 0.00211211
Time for 500 iterations was 7.76763 seconds
Each iteration took 0.0155353 seconds


Writing data files ...
... done!

Written gnuplot script 'cfd.plt'
... finished


---

# Start adding OpenACC Pragmas

Now, you can start modifying the C++ code and the `Makefile`:

[cfd code](../source_code/openacc/cfd.cpp) 

[Makefile](../source_code/openacc/Makefile)

Remember to **SAVE** your code after changes, before running below cells.

#### Some Hints

1) Notice implicit and explicit copy of variables --> Add `-Minfo=accel` flag to `Makefile`.

2) Check if there is any data race in your code.( More details on data race is present in the Links and resources section below)

## Compile and run OpenACC enabled code


In [118]:
!cd ../source_code/openacc && make clean && make

rm -f arraymalloc.o boundary.o cfd.o cfdio.o jacobi.o cfd velocity.dat colourmap.dat cfd.plt core
nvc++ -acc -ta=tesla:managed,lineinfo -Minfo=accel -lm -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/include -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt -c arraymalloc.cpp
nvc++ -acc -ta=tesla:managed,lineinfo -Minfo=accel -lm -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/include -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt -c boundary.cpp
nvc++ -acc -ta=tesla:managed,lineinfo -Minfo=accel -lm -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/include -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt -c cfd.cpp
main:
    220, Generating Tesla code
        220, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
        222,   /* blockIdx.x threadIdx.x collapsed */
    220, Generating implicit copyin(psitmp[:]) [if not already present]
         Generating implicit copyout(psi[:]) [if not alrea

Hint : Add `-Minfo=accel` to the `Makefile` to check that Kernel code indeed has been generated.

## Profile the OpenACC Code

In [119]:
!cd ../source_code/openacc && nsys profile -t nvtx,openacc,cuda --stats=true --force-overwrite true -o minicfdopenacc_profile ./cfd 64 500

Collecting data...
Scale Factor = 64, iterations = 500
Irrotational flow
Running CFD on 2048 x 2048 grid in serial

Starting main loop...


... finished
After 500 iterations, the error is 0.00211211
Time for 500 iterations was 0.210083 seconds
Each iteration took 0.000420166 seconds


Writing data files ...
... done!

Written gnuplot script 'cfd.plt'
... finished
Processing events...
Saving temporary "/tmp/nsys-report-6f02-6ab2-c5b7-6da5.qdstrm" file to disk...

Creating final output files...
Saved report file to "/tmp/nsys-report-6f02-6ab2-c5b7-6da5.qdrep"

Exported successfully to
/tmp/nsys-report-6f02-6ab2-c5b7-6da5.sqlite


CUDA API Statistics:

 Time(%)  Total Time (ns)  Num Calls   Average    Minimum   Maximum           Name        
 -------  ---------------  ---------  ----------  --------  --------  --------------------
    80.6        178585662       1001    178407.3      9730  11975135  cuCtxSynchronize    
     9.1         20240031          1  20240031.0  20240031  20240031 

In [120]:
# multicore version
!cd ../source_code/openacc && nsys profile -t nvtx,openacc,cuda --stats=true --force-overwrite true -o minicfdopenacc_profile ./cfd 64 500

Collecting data...
Scale Factor = 64, iterations = 500
Irrotational flow
Running CFD on 2048 x 2048 grid in serial

Starting main loop...


... finished
After 500 iterations, the error is 0.00211211
Time for 500 iterations was 0.239742 seconds
Each iteration took 0.000479484 seconds


Writing data files ...
... done!

Written gnuplot script 'cfd.plt'
... finished
Processing events...
Saving temporary "/tmp/nsys-report-910d-e06b-f0ee-5701.qdstrm" file to disk...

Creating final output files...
Saved report file to "/tmp/nsys-report-910d-e06b-f0ee-5701.qdrep"

Exported successfully to
/tmp/nsys-report-910d-e06b-f0ee-5701.sqlite


CUDA API Statistics:

 Time(%)  Total Time (ns)  Num Calls   Average    Minimum   Maximum           Name        
 -------  ---------------  ---------  ----------  --------  --------  --------------------
    81.7        204862322       1001    204657.7      3528  11773584  cuCtxSynchronize    
     8.3         20909080          1  20909080.0  20909080  20909080 

You can examine the output on the terminal or you can download the file and view the timeline by opening the output with the NVIDIA Nsight Systems.

Download and save the profiler report file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../source_code/openacc/minicfdopenacc_profile.qdrep).

## Validating the Output

Make sure the error value printed as output matches that of the serial code

# Recommendations for adding OpenACC Pragmas

After finding the hotspot function take an incremental approach to add pargmas. 

1) Ignore the initialization, finalization and I/O functions

2) Take an incremental approach by adding pragmas one at a time

3) Unified Memory provides a good start point where you need not worry about the data transfers (`–ta=tesla:managed`)

4) Cross check the output after incremental changes to check algorithmic scalability

5) Move on to using data clauses for better performance 

6) Start with a small problem size that reduces the execution time. 


**General tip:** Be aware of *Data Race* situation in which at least two threads access a shared variable at the same time. At least on thread tries to modify the variable. If data race happened, an incorrect result will be returned. So, make sure to validate your output against the serial version.

# Links and Resources

[OpenACC API Guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC%20API%202.6%20Reference%20Guide.pdf)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 



## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).