# Linear Transformation on a GPU with SYCL

This exercise does not go beyond what was shown in the very first exercise
already. It just shows the same kind of setup, with
[oneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html) /
[SYCL](https://www.khronos.org/sycl/).

Note that at this exercise was only set up for/on SWAN. As I could not find a
way to target GPUs on Perlmutter with the oneAPI installations available there.

## Build the project

The build happens in the same way as in all the previous exercises.

In [None]:
!./scripts/run_on_swan.sh cmake -DCMAKE_BUILD_TYPE=Debug -S . -B build
!./scripts/run_on_swan.sh cmake --build build

## Code Structure

This last example is set up in the same way as the first one.
  - [LinearTransformSYCLAlg.h](SYCLExamples/src/04_LinearTransform/LinearTransformSYCLAlg.h):
    Simple (reentrant) algorithm header with no members, and an `initialize()`
    and `execute(...)` function.
  - [LinearTransofmrSYCLAlg.sycl](SYCLExamples/src/04_LinearTransform/LinearTransformSYCLAlg.sycl):
    SYCL file holding both all of the C\+\+ code of the algorithm, including the
    "in-line" kernel executed by the algrithm.
  - [04_LinearTransformConfig.py](SYCLExamples/python/04_LinearTransformConfig.py):
    CA configuration for running the example algorithm in a 2-threaded MT job,
    without any input file.

## Running The Code

Unlike the previous exercises, this one works "okay" out of the box. The
purpose here is more to show you how SYCL code is structured at its simplest
level. Including that it can be used to run parallel code on practically any
type of GPU, including NVIDIA ones.

To run the job on SWAN, execute the following:

In [None]:
!./scripts/run_on_swan.sh ./build/CMakeFiles/atlas_build_run.sh athena.py --threads=2 --CA SYCLExamples/04_LinearTransformConfig.py

## Tasks

  - To show just the most basic of things, experiment with different ways
    in which you could define the kernel function. The example uses a lambda
    for the kernel. But basically any type of "executable" can be used there.
    Including simple standalone functions, functors, etc.
  - The example uses the simplest form of executing a parallel-for with SYCL.
    But in most cases we rather want to control the execution on a GPU in the
    same way that you learned from CUDA. Specifying the size of the thread
    groups, and the number of thread groups to start.
    * This is achieved by using
      [sycl::nd_range](https://github.khronos.org/SYCL_Reference/iface/nd_range.html)
      instead of
      [sycl::range](https://github.khronos.org/SYCL_Reference/iface/range.html)
      that the example uses currently.
    * Update the code to:
      - Provide an appropriate `sycl::nd_range` object to the `parallel_for`
        call;
      - Update the kernel lambda to take a
        [sycl::nd_item](https://github.khronos.org/SYCL_Reference/iface/nd_item.html)
        parameter instead of the current
        [sycl::id](https://github.khronos.org/SYCL_Reference/iface/id.html)
        object;
      - Experiment with the extra information provided by `sycl::nd_item` over
        `sycl::id`.