# CUDA C Acceleration 
Let's execute the cell below to display information about the GPUs running on the server by running the `nvidia-smi` command, which ships with the Nvidia GPU Drivers that we will be using. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
!nvidia-smi

## Copy and Compile the Serial code

Before start modifying the serial code, let's make a copy of the serial code and rename it.

In [None]:
!cp ../source_code/serial/* ../source_code/cuda-c

In [None]:
!cd ../source_code/cuda-c && make clean && make

## Run the Serial code

In [None]:
!cd ../source_code/cuda-c && ./cfd 64 500

---

# Start adding CUDA C constructs

Now, you can start modifying the C++ code and the `Makefile`:

[cfd code](../source_code/cuda-c/cfd.cpp) 

[Makefile](../source_code/cuda-c/Makefile)

Remember to **SAVE** your code after changes, before running below cells.

#### Some Hints
Check if there is any data race in your code.( More details on data race is present in the Links and resources section below)

## Compile and run CUDA C enabled code


In [None]:
!cd ../source_code/cuda-c && make clean && make

## Profile the CUDA C Code

In [None]:
!cd ../source_code/cuda-c && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o minicfdcudac_profile ./cfd 64 500

You can examine the output on the terminal or you can download the file and view the timeline by opening the output with the NVIDIA Nsight Systems.

Download and save the profiler report file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../source_code/cuda-c/minicfdcudac_profile.qdrep).

## Validating the Output

Make sure the error value printed as output matches that of the serial code

# Recommendations for adding CUDA C

After finding the hotspot function take an incremental approach to add pargmas. 

1) Convert files using CUDA kernels to .cu 

2) Ignore the initialization, finalization and I/O functions

3) Cross check the output after incremental changes to check algorithmic scalability

4) Start with a small problem size that reduces the execution time. 


**General tip:** Be aware of *Data Race* situation in which at least two threads access a shared variable at the same time. At least on thread tries to modify the variable. If data race happened, an incorrect result will be returned. So, make sure to validate your output against the serial version.

# Links and Resources

[CUDA Introduction ](https://developer.nvidia.com/blog/even-easier-introduction-cuda/)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 



## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).