# CUDA Fortran Acceleration 
Before we begin, let us execute the below cell to display information about the NVIDIA® CUDA® driver and the GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by clicking on it with your mouse, and pressing Ctrl+Enter, or pressing the play button in the toolbar above. You should see some output returned below the grey cell.

In [None]:
!nvidia-smi

## Copy and Compile the Serial code

Before start modifying the serial code, let's make a copy of the serial code and rename it.

In [None]:
!cp ../source_code/serial/* ../source_code/cudafortran

In [None]:
!cd ../source_code/cudafortran && make clean && make

## Run the Serial code

In [None]:
!cd ../source_code/cudafortran && ./cfd 64 500

---

# Start adding CUDA Fortran constructs

Now, you can start modifying the Fortran code and the `Makefile`:

[cfd code](../source_code/cudafortran/cfd.f90) 

[Makefile](../source_code/cudafortran/Makefile)

Remember to **SAVE** your code after changes, before running below cells.

#### Some Hints
Check if there is any data race in your code.( More details on data race is present in the Links and resources section below)

## Compile and run CUDA Fortran enabled code


In [None]:
!cd ../source_code/cudafortran && make clean && make

## Profile the CUDA Fortran Code

In [None]:
!cd ../source_code/cudafortran && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o minicfdcudafortran_profile ./cfd 64 500

You can examine the output on the terminal or you can download the file and view the timeline by opening the output with the NVIDIA Nsight Systems.

Download and save the report file by holding down <mark>Shift</mark> and <mark>right-clicking</mark> [here](../source_code/cudafortran/minicfdcudafortran_profile.nsys-rep) then choosing <mark>save Link As</mark>. Once done, open it via the GUI.

## Validating the Output

Make sure the error value printed as output matches that of the serial code

# Recommendations for adding CUDA Fortran

After finding the hotspot function take an incremental approach to add pargmas. 

1) Ignore the initialization, finalization and I/O functions

2) Cross check the output after incremental changes to check algorithmic scalability

3) Start with a small problem size that reduces the execution time. 


**General tip:** Be aware of *Data Race* situation in which at least two threads access a shared variable at the same time. At least on thread tries to modify the variable. If data race happened, an incorrect result will be returned. So, make sure to validate your output against the serial version.

# Links and Resources

[CUDA Introduction ](https://developer.nvidia.com/blog/even-easier-introduction-cuda/)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight Systems profiler output, please download the latest version of Nsight Systems from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [Open Hackathons Resources](https://www.openhackathons.org/s/technical-resources) and join our [OpenACC and Hackathons Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 



## Licensing 

Copyright © 2022 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.