# Linear Transformation on a GPU with CUDA

The first exercise demonstrates the most bare bones setup for a CUDA algorithm
in Athena. It will show you how to set up a reentrant algorithm that would
run synchronous CUDA commands to perform a trivial operation on a GPU.

## Build The Project

As the very first step, let's try to build the project.
  - On Perlmutter:

In [None]:
!./scripts/run_on_perlmutter.sh cmake -DCMAKE_BUILD_TYPE=Debug -S . -B build
!./scripts/run_on_perlmutter.sh cmake --build build

  - On SWAN:

In [None]:
!./scripts/run_command.sh cmake -DCMAKE_BUILD_TYPE=Debug -S . -B build
!./scripts/run_command.sh cmake --build build

In case you got the following error from the previous command, you have some
additional work to do:

```text
ERROR:
ERROR: CUDA not available. Please configure the setup script!
ERROR:
```

If you received this error, please edit the value of `CUDA_DIR` in
[scripts/build_env.sh](scripts/build_env.sh). After setting that variable
to point to a valid local installation, try to re-run the previous cell.

## Code Structure

This first example is set up very trivially.
  - [LinearTransformCUDAAlg.h](CUDAExamples/src/01_LinearTransform/LinearTransformCUDAAlg.h):
    Simple (reentrant) algorithm header with no members, and an `initialize()`
    and `execute(...)` function.
  - [LinearTransofmrCUDAAlg.cu](CUDAExamples/src/01_LinearTransform/LinearTransformCUDAAlg.cu):
    CUDA file holding both all of the C\+\+ code of the algorithm, and the
    CUDA (kernel) code that would be run by the algorithm.
     * See Example 2 for details on why this setup is not a good idea for
       production code.
  - [01_LinearTransformConfig.py](CUDAExamples/python/01_LinearTransformConfig.py):
    CA configuration for running the example algorithm in a 2-threaded MT job,
    without any input file.

## Running The Code As Is

Try to execute the code in its present form!
  - On Perlmutter:

In [None]:
!./scripts/run_on_perlmutter.sh ./build/CMakeFiles/atlas_build_run.sh athena.py --threads=2 --CA CUDAExamples/01_LinearTransformConfig.py

  - On SWAN:

In [None]:
!./scripts/run_command.sh ./build/CMakeFiles/atlas_build_run.sh athena.py --threads=2 --CA CUDAExamples/01_LinearTransformConfig.py

The job should have started successfully, but then stopped in the first event
with an error complaining about an illegal memory access.

## Debugging The Athena Job

If you have a careful look at
[LinearTransofmrCUDAAlg.cu](CUDAExamples/src/01_LinearTransform/LinearTransformCUDAAlg.cu),
you can probably figure out without additional help as well what the problem
may be. But let's rather use
[cuda-gdb](https://docs.nvidia.com/cuda/cuda-gdb/index.html) to understand where
the error is coming from.

Debugging is an interactive task, and based on the following command, you should
try to set up an interactive debugging session to inspect the code with. But to
make it a little easier to do everything from the notebook directly, the
following command will run a batched debug job (with the help of
[01_LinearTransformCrashBacktrace.txt](CUDAExamples/debug/01_LinearTransformCrashBacktrace.txt)),
printing the relevant output necessary to understand what it going wrong.

  - On Perlmutter:

In [None]:
!./scripts/run_on_perlmutter.sh ./build/CMakeFiles/atlas_build_run.sh ./scripts/cuda-gdb.sh --batch --command=./CUDAExamples/debug/01_LinearTransformCrashBacktrace.txt --args python ./CUDAExamples/python/01_LinearTransformConfig.py

  - On SWAN:

In [None]:
!./scripts/run_command.sh ./build/CMakeFiles/atlas_build_run.sh cuda-gdb --batch --command=./CUDAExamples/debug/01_LinearTransformCrashBacktrace.txt --args python ./CUDAExamples/python/01_LinearTransformConfig.py

Beside `cuda-gdb`, another very good tool to use to find problems in CUDA code,
is [compute-sanitizer](https://docs.nvidia.com/compute-sanitizer/index.html).
Use the following command to inspect the example job.
  - On Perlmutter:

In [None]:
!./scripts/run_on_perlmutter.sh ./build/CMakeFiles/atlas_build_run.sh compute-sanitizer --leak-check=full athena.py --CA CUDAExamples/01_LinearTransformConfig.py

  - On SWAN:

In [None]:
!./scripts/run_command.sh ./build/CMakeFiles/atlas_build_run.sh compute-sanitizer --leak-check=full athena.py --CA CUDAExamples/01_LinearTransformConfig.py

## Tasks

  - Update `GPUTutorial::Kernels::linearTransform` to work correctly with
    arbitrary grid sizes.
  - Fix the memory leak issue(s) in
    `GPUTutorial::LinearTransformCUDAAlg::execute`.
  - Try to update the code such that memory would be freed even in case of
    runtime errors.

After you updated the code, run the following cell to re-build it.
  - On Perlmutter:

In [None]:
!./scripts/run_on_perlmutter.sh cmake --build build

  - On SWAN:

In [None]:
!./scripts/run_command.sh cmake --build build

The goal is to have the example job finish successfully, with output like:

```text
...
GPUTutorial::LinearTransformCUDAAlg                   97     1    INFO outputHost[1000]   = 2001
GPUTutorial::LinearTransformCUDAAlg                   97     1    INFO outputHost[999999] = 2e+06
AthenaHiveEventLoopMgr                                98     0    INFO   ===>>>  done processing event #99, run #1 on slot 0,  98 events processed so far  <<<===
AthenaHiveEventLoopMgr                                99     0    INFO   ===>>>  start processing event #100, run #1 on slot 0,  98 events processed so far  <<<===
AthenaHiveEventLoopMgr                                97     1    INFO   ===>>>  done processing event #98, run #1 on slot 1,  99 events processed so far  <<<===
GPUTutorial::LinearTransformCUDAAlg                   99     0    INFO outputHost[0]      = 1
GPUTutorial::LinearTransformCUDAAlg                   99     0    INFO outputHost[1000]   = 2001
GPUTutorial::LinearTransformCUDAAlg                   99     0    INFO outputHost[999999] = 2e+06
AthenaHiveEventLoopMgr                                99     0    INFO   ===>>>  done processing event #100, run #1 on slot 0,  100 events processed so far  <<<===
AthenaHiveEventLoopMgr                                99     0    INFO ---> Loop Finished (seconds): 5.62849
ApplicationMgr                                                    INFO Application Manager Stopped successfully
SGInputLoader                                                     INFO Finalizing SGInputLoader...
AvalancheSchedulerSvc                                             INFO Joining Scheduler thread
FPEAuditor                                                        INFO FPE summary for this job
FPEAuditor                                                        INFO  FPE OVERFLOWs  : 0
FPEAuditor                                                        INFO  FPE INVALIDs   : 0
FPEAuditor                                                        INFO  FPE DIVBYZEROs : 0
EventDataSvc                                                      INFO Finalizing EventDataSvc
ApplicationMgr                                                    INFO Application Manager Finalized successfully
ApplicationMgr                                                    INFO Application Manager Terminated successfully
```

Also check that `compute-sanitizer` would not report any memory leaks!
  - On Perlmutter:

In [None]:
!./scripts/run_on_perlmutter.sh ./build/CMakeFiles/atlas_build_run.sh athena.py --threads=2 --CA CUDAExamples/01_LinearTransformConfig.py

  - On SWAN:

In [None]:
!./scripts/run_command.sh ./build/CMakeFiles/atlas_build_run.sh athena.py --threads=2 --CA CUDAExamples/01_LinearTransformConfig.py