# ISO3DFD and Implementation using SYCL offloading to a GPU

## Learning Objectives

<ul>
    <li>Understand how to offload the most profitable loops in your code on the GPU using SYCL</li>    
    <li>Map arrays on the device and define how you are going to access you data</li>
    <li>Offload the loops to dispatch the work on the selected device</li>
</ul>

## ISO3DFD offloading to GPU

In the previous activity, we used Intel® Offload Advisor to decide which sections of codes would be good candidates for offloading on the gen9. Advisor ended up recommending to focus on one of the most profitable loop in our serial version of the CPU code.

Our goal is now to make sure that this loop is going to be correctly offloaded to a GPU

Based on the output provided by the Advisor, we can see the estimated speed-up if we offload loops identified in the Top Offloaded section of the output.Using SYCL, we'll offload that function to run as a kernel on the system's GPU.

## Offloading ISO3DFD application to a GPU
The 2_GPU_basic_offload version of the sample has implemented the basic offload of the iso3dfd function to an available GPU on the system.
* We have to create queue in iso3dfd function as per below.

```
queue q(default_selector_v, {property::queue::in_order()});
```
* Instead of iterating over all the cells in the memory, we will create buffers and accessors to move the data to the GPU when needed

```
// Create 3D SYCL range for kernels which not include HALO
  range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength,
                        n3 - 2 * kHalfLength);
  // Create 3D SYCL range for buffers which include HALO
  range<3> buffer_range(n1, n2, n3);
  // Create buffers using SYCL class buffer
  buffer next_buf(next, buffer_range);
  buffer prev_buf(prev, buffer_range);
  buffer vel_buf(vel, buffer_range);
  buffer coeff_buf(coeff, range(kHalfLength + 1));
  ```
* Create a kernel which will do the calculations, each kernel will calculate one cell.

```
// Send a SYCL kernel(lambda) to the device for parallel execution
      // Each kernel runs single cell
      h.parallel_for(kernel_range, [=](id<3> idx) {
        // Start of device code
        // Add offsets to indices to exclude HALO
        int i = idx[0] + kHalfLength;
        int j = idx[1] + kHalfLength;
        int k = idx[2] + kHalfLength;

        // Calculate values for each cell
        //Please refer to the below source code        
      });
    });
    
```
The SYCL code below shows Iso3dFD GPU code using SYCL: Inspect code, there are no modifications necessary:
1. Inspect the code cell below and click run ▶ to save the code to file
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile src/2_GPU_basic.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>
#include <chrono>
#include <string>
#include <fstream>

#include "Utils.hpp"

using namespace sycl;

void iso3dfd(queue& q, float* next, float* prev, float* vel, float* coeff,
             const size_t n1, const size_t n2, const size_t n3,
             const size_t nreps) {
  // Create 3D SYCL range for kernels which not include HALO
  range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength,
                        n3 - 2 * kHalfLength);
  // Create 3D SYCL range for buffers which include HALO
  range<3> buffer_range(n1, n2, n3);
  // Create buffers using SYCL class buffer
  buffer next_buf(next, buffer_range);
  buffer prev_buf(prev, buffer_range);
  buffer vel_buf(vel, buffer_range);
  buffer coeff_buf(coeff, range(kHalfLength + 1));

  for (auto it = 0; it < nreps; it += 1) {
    // Submit command group for execution
    q.submit([&](handler& h) {
      // Create accessors
      accessor next_acc(next_buf, h);
      accessor prev_acc(prev_buf, h);
      accessor vel_acc(vel_buf, h, read_only);
      accessor coeff_acc(coeff_buf, h, read_only);

      // Send a SYCL kernel(lambda) to the device for parallel execution
      // Each kernel runs single cell
      h.parallel_for(kernel_range, [=](id<3> idx) {
        // Start of device code
        // Add offsets to indices to exclude HALO
        int i = idx[0] + kHalfLength;
        int j = idx[1] + kHalfLength;
        int k = idx[2] + kHalfLength;

        // Calculate values for each cell
        float value = prev_acc[i][j][k] * coeff_acc[0];
#pragma unroll(8)
        for (int x = 1; x <= kHalfLength; x++) {
          value +=
              coeff_acc[x] * (prev_acc[i][j][k + x] + prev_acc[i][j][k - x] +
                              prev_acc[i][j + x][k] + prev_acc[i][j - x][k] +
                              prev_acc[i + x][j][k] + prev_acc[i - x][j][k]);
        }
        next_acc[i][j][k] = 2.0f * prev_acc[i][j][k] - next_acc[i][j][k] +
                            value * vel_acc[i][j][k];
        // End of device code
      });
    });

    // Swap the buffers for always having current values in prev buffer
    std::swap(next_buf, prev_buf);
  }
}

int main(int argc, char* argv[]) {
  // Arrays used to update the wavefield
  float* prev;
  float* next;
  // Array to store wave velocity
  float* vel;

  // Variables to store size of grids and number of simulation iterations
  size_t n1, n2, n3;
  size_t num_iterations;

  // Flag to verify results with CPU version
  bool verify = false;

  if (argc < 5) {
    Usage(argv[0]);
    return 1;
  }

  try {
    // Parse command line arguments and increase them by HALO
    n1 = std::stoi(argv[1]) + (2 * kHalfLength);
    n2 = std::stoi(argv[2]) + (2 * kHalfLength);
    n3 = std::stoi(argv[3]) + (2 * kHalfLength);
    num_iterations = std::stoi(argv[4]);
    if (argc > 5) verify = true;
  } catch (...) {
    Usage(argv[0]);
    return 1;
  }

  // Validate input sizes for the grid
  if (ValidateInput(n1, n2, n3, num_iterations)) {
    Usage(argv[0]);
    return 1;
  }

  // Create queue and print target info with default selector and in order
  // property
  queue q(default_selector_v, {property::queue::in_order()});
  std::cout << " Running GPU basic offload version\n";
  printTargetInfo(q);

  // Compute the total size of grid
  size_t nsize = n1 * n2 * n3;

  prev = new float[nsize];
  next = new float[nsize];
  vel = new float[nsize];

  // Compute coefficients to be used in wavefield update
  float coeff[kHalfLength + 1] = {-3.0548446,   +1.7777778,     -3.1111111e-1,
                                  +7.572087e-2, -1.76767677e-2, +3.480962e-3,
                                  -5.180005e-4, +5.074287e-5,   -2.42812e-6};

  // Apply the DX, DY and DZ to coefficients
  coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz);
  for (auto i = 1; i <= kHalfLength; i++) {
    coeff[i] = coeff[i] / (dxyz * dxyz);
  }

  // Initialize arrays and introduce initial conditions (source)
  initialize(prev, next, vel, n1, n2, n3);

  auto start = std::chrono::steady_clock::now();

  // Invoke the driver function to perform 3D wave propagation offloaded to
  // the device
  iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations);

  auto end = std::chrono::steady_clock::now();
  auto time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
                  .count();
  printStats(time, n1, n2, n3, num_iterations);

  // Verify result with the CPU serial version
  if (verify) {
    VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations);
  }

  delete[] prev;
  delete[] next;
  delete[] vel;

  return 0;
}

Once the application is created, we can run it from the command line by using few parameters as following:
src/2_GPU_basic 256 256 256 100
<ul>
    <li>bin/2_GPU_basic is the binary</li>
    <li>128 128 128 are the size for the 3 dimensions, increasing it will result in more computation time</li>    
    <li>100 is the number of time steps</li>
</ul>

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpu_only.sh;if [ -x "$(command -v qsub)" ]; then ./q run_gpu_only.sh; else ./run_gpu_only.sh; fi

## Iso3DFD GPU Optimizations

We started from a code version running with standard C++ on the CPU. Using Intel® Offload Advisor, we determined which loop was a good candidate for offload and then using SYCL we worked on a solution to make our code run on the GPU but also on the CPU.

Getting the best performances possible on the CPU or on the GPU would require some fine tuning specific to each platform but we already have a portable solution.

The next step, to optimize further on the GPU would be to run the Roofline Model and/or VTune to try to understand if we have obvious bottlenecks.

## What is the Roofline Model?

A Roofline chart is a visual representation of application performance in relation to hardware limitations, including memory bandwidth and computational peaks.  Intel Advisor includes an automated Roofline tool that measures and plots the chart on its own, so all you need to do is read it.

The chart can be used to identify not only where bottlenecks exist, but what’s likely causing them, and which ones will provide the most speedup if optimized.

#### Requirements for a Roofline Model on a GPU
In order to generate a roofline analysis report ,application must be at least partially running on a GPU and the Offload must be implemented with OpenMP, SYCL or OpenCL and a recent version of Intel® Advisor 

Generating a Roofline Model on GPU generates a multi-level roofline where a single loop generates several dots and each dot can be compared to its own memory (GTI/L3/DRAM/SLM)


#### Finding Effective Optimization Strategies
 Here are the GPU Roofline Performance Insights, it highlights poor performing loops and shows performance ‘headroom’  for each loop which can be improved and which are worth improving. The report shows likely causes of bottlenecks where it can be Memory bound vs. compute bound. It also suggests next optimization steps

  
  <img src="img/r1.png">
 


### Running the GPU Roofline Analysis
With the offload implemented in 2_GPU_basic using SYCL, we'll want to run roofline analysis to look for areas where there is room for performance optimization.
```
advisor --collect=roofline --profile-gpu --project-dir=./advi_results -- ./myApplication 
```
The iso3DFD CPU code can be run using
```
advisor --collect=roofline --profile-gpu --project-dir=./../advisor/2_gpu -- ./build/src/2_GPU_basic 256 256 256 100
```

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpu_roofline_advisor.sh;if [ -x "$(command -v qsub)" ]; then ./q run_gpu_roofline_advisor.sh; else ./run_gpu_roofline_advisor.sh; fi

### Analyzing the HTML report

From the roofline analysis of the 2_GPU_basic_offload.cpp version, we can see that the performance is close to predicted. 
As noted in the below roofline model we can observe that,

* The application is bounded by compute, specifically that the kernels have high arithmetic intensity.
* GINTOPS is more than 15X of the GFLOPS
* High XVE Threading Occupancy
* We are clearly bounded by the INT operations which is all about index computations

<img src="img/gpu_basic.png">

### Roofline Analysis report
To display the report, just execute the following frame. In practice, the report will be available in the folder you defined as --out-dir in the previous script. 

[View the report in HTML](reports/advisor-report.html)

In [None]:
from IPython.display import IFrame
display(IFrame(src='reports/advisor-report.html', width=1024, height=768))


## Generating VTune reports
Below exercises we use VTune™  analyzer as a way to see what is going on with each implementation. The information was the high-level hotspot generated from the collection and rendered in an HTML iframe. Depending on the options chosen, many of the VTune analyzer's performance collections can be rendered via HTML pages. The below vtune scripts collect GPU offload and GPU hotspots information.

#### Learn more about VTune
​
There is extensive training on VTune, click [here](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html#gs.2xmez3) to get deep dive training.

```
vtune -run-pass-thru=--no-altstack -collect=gpu-offload -result-dir=vtune_dir -- ./build/src/2_GPU_basic 1024 1024 1024 100
```

```
vtune -run-pass-thru=--no-altstack -collect=gpu-hotspots -result-dir=vtune_dir_hotspots -- ./build/src/2_GPU_basic 1024 1024 1024 100
```

```
vtune -report summary -result-dir vtune_dir -format html -report-output ./reports/output_offload.html
```

```
vtune -report summary -result-dir vtune_dir_hotspots -format html -report-output ./reports/output_hotspots.html
```
[View the Vtune offload report in HTML](reports/output_offload.html)

In [None]:
from IPython.display import IFrame
display(IFrame(src='reports/output_offload.html', width=1024, height=768))


[View the Vtune hotspots report in HTML](reports/output_hotspots.html)

In [None]:

from IPython.display import IFrame
display(IFrame(src='reports/output_hotspots.html', width=1024, height=768))


### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpu_vtune.sh;if [ -x "$(command -v qsub)" ]; then ./q run_gpu_vtune.sh; else ./run_gpu_vtune.sh; fi

## Summary
### Next Iteration of implemeting GPU Optimizations

We ran the roofline model and observed:
* The application is now bounded by compute, specifically that the kernels have high arithmetic intensity and we are bounded by the INT operations which is all about index computations.
* What we need to solve, is to provide to the kernel the good index (offset in the original code). 
* SYCL provides this information through an iterator that is sent by the runtime to the function. This iterator allows to identify the position of the current iteration in the 3D space. 
* It can be accessed on 3 dimensions by calling: it.get_global_id(0), it.get_global_id(1), it.get_global_id(2).
* In this next iteration, we'll address the problem being compute bound in kernels by reducing index calculations by changing how we calculate indices.
