# ISO3DFD on a GPU and Index computations

## Learning Objectives

<ul>
    <li>Understand how to address the application being compute bound by reducing index calculations</li>    
    <li>Run roofline analysis and the VTune reports again to gauge the results and look for additional opportunities</li>
    
</ul>

## Iso3DFD reducing the index calculations

In the previous activity, we used Intel® Advisor roofline analysis to decide on if the application is compute bound and specifically that the kernels have high arithmetic intensity and we are bounded by the INT operations which is all about index computations.
What we need to solve, is to provide to the kernel the good index (offset in the original code). SYCL provides this information through an iterator that is sent by the runtime to the function. This iterator allows to identify the position of the current iteration in the 3D space. It can be accessed on 3 dimensions by calling: it.get_global_id(0), it.get_global_id(1), it.get_global_id(2).

In this notebook, we'll address the problem being compute bound in kernels by reducing index calculations by changing how we calculate indices.

## Optimizing the Indexing of the Iso3DFD application
The 3_GPU_linear version of the sample has implemented the index calculations optimization, where we can change the 3D indexing to 1D. We need to flatten the buffers change how we calculate location in the memory for each kernel, and change how we are accessing the neighbors.
* For index calculations optimization, we need to change the 3D indexing to 1D and also need to flatten the buffers

```
// Create 1D SYCL range for buffers which include HALO
range<1> buffer_range(n1 * n2 * n3);
// Create buffers using SYCL class buffer
buffer next_buf(next, buffer_range);
buffer prev_buf(prev, buffer_range);
buffer vel_buf(vel, buffer_range);
buffer coeff_buf(coeff, range(kHalfLength + 1));
```

* We change how we calculate location in the memory for each kernel

```
// Start of device code
// Add offsets to indices to exclude HALO
int n2n3 = n2 * n3;
int i = nidx[0] + kHalfLength;
int j = nidx[1] + kHalfLength;
int k = nidx[2] + kHalfLength;

// Calculate linear index for each cell
int idx = i * n2n3 + j * n3 + k;

```
* We change how we are accessing the neighbors

```
// Calculate values for each cell
    float value = prev_acc[idx] * coeff_acc[0];
#pragma unroll(8)
    for (int x = 1; x <= kHalfLength; x++) {
      value +=
          coeff_acc[x] * (prev_acc[idx + x]        + prev_acc[idx - x] +
                          prev_acc[idx + x * n3]   + prev_acc[idx - x * n3] +
                          prev_acc[idx + x * n2n3] + prev_acc[idx - x * n2n3]);
    }
    next_acc[idx] = 2.0f * prev_acc[idx] - next_acc[idx] +
                    value * vel_acc[idx];
// End of device code
});
});

```
We will run roofline analysis and the VTune reports again to gauge the results and look for additional opportunities for optimization based on 3_GPU_linear.

The SYCL code below shows Iso3dFD GPU code using SYCL with Index optimizations: Inspect code, there are no modifications necessary:
1. Inspect the code cell below and click run ▶ to save the code to file
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile src/3_GPU_linear_USM.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>
#include <chrono>
#include <string>
#include <fstream>

#include "Utils.hpp"

using namespace sycl;

bool iso3dfd(sycl::queue &q, float *ptr_next, float *ptr_prev,
                   float *ptr_vel, float *ptr_coeff, size_t n1, size_t n2,
                   size_t n3, unsigned int nIterations) {
  auto nx = n1;
  auto nxy = n1*n2;
  auto grid_size = n1*n2*n3;
  auto b1 = kHalfLength;
  auto b2 = kHalfLength;
  auto b3 = kHalfLength;
  
  // Create 3D SYCL range for kernels which not include HALO
  range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength,
                        n3 - 2 * kHalfLength);

  auto next = sycl::aligned_alloc_device<float>(64, grid_size + 16, q);
  next += (16 - b1);
  q.memcpy(next, ptr_next, sizeof(float)*grid_size);
  auto prev = sycl::aligned_alloc_device<float>(64, grid_size + 16, q);
  prev += (16 - b1);
  q.memcpy(prev, ptr_prev, sizeof(float)*grid_size);
  auto vel = sycl::aligned_alloc_device<float>(64, grid_size + 16, q);
  vel += (16 - b1);
  q.memcpy(vel, ptr_vel, sizeof(float)*grid_size);
  auto coeff = sycl::aligned_alloc_device<float>(64, kHalfLength + 1, q);
  //coeff += (16 - b1);
  q.memcpy(coeff, ptr_coeff, sizeof(float)*(kHalfLength+1));
  q.wait();

  for (auto it = 0; it < nIterations; it += 1) {
    // Submit command group for execution
    q.submit([&](handler& h) {
      // Send a SYCL kernel(lambda) to the device for parallel execution
      // Each kernel runs single cell
      h.parallel_for(kernel_range, [=](id<3> idx) {
        // Start of device code
        // Add offsets to indices to exclude HALO
        int n2n3 = n2 * n3;
        int i = idx[0] + kHalfLength;
        int j = idx[1] + kHalfLength;
        int k = idx[2] + kHalfLength;

        // Calculate linear index for each cell
        int gid = i * n2n3 + j * n3 + k;
        auto value = coeff[0] * prev[gid];
          
        // Calculate values for each cell
#pragma unroll(8)
        for (int x = 1; x <= kHalfLength; x++) {
          value += coeff[x] * (prev[gid + x] + prev[gid - x] +
                               prev[gid + x * n3]   + prev[gid - x * n3] +
                               prev[gid + x * n2n3] + prev[gid - x * n2n3]);
        }
        next[gid] = 2.0f * prev[gid] - next[gid] + value * vel[gid];
          
        // End of device code
      });
    }).wait();

    // Swap the buffers for always having current values in prev buffer
    std::swap(next, prev);
  }
  q.memcpy(ptr_prev, prev, sizeof(float)*grid_size);

  sycl::free(next - (16 - b1),q);
  sycl::free(prev - (16 - b1),q);
  sycl::free(vel - (16 - b1),q);
  sycl::free(coeff,q);
  return true;
}

int main(int argc, char* argv[]) {
  // Arrays used to update the wavefield
  float* prev;
  float* next;
  // Array to store wave velocity
  float* vel;

  // Variables to store size of grids and number of simulation iterations
  size_t n1, n2, n3;
  size_t num_iterations;

  // Flag to verify results with CPU version
  bool verify = false;

  if (argc < 5) {
    Usage(argv[0]);
    return 1;
  }

  try {
    // Parse command line arguments and increase them by HALO
    n1 = std::stoi(argv[1]) + (2 * kHalfLength);
    n2 = std::stoi(argv[2]) + (2 * kHalfLength);
    n3 = std::stoi(argv[3]) + (2 * kHalfLength);
    num_iterations = std::stoi(argv[4]);
    if (argc > 5) verify = true;
  } catch (...) {
    Usage(argv[0]);
    return 1;
  }

  // Validate input sizes for the grid
  if (ValidateInput(n1, n2, n3, num_iterations)) {
    Usage(argv[0]);
    return 1;
  }

  // Create queue and print target info with default selector and in order
  // property
  queue q(default_selector_v, {property::queue::in_order()});
  std::cout << " Running linear indexed GPU version\n";
  printTargetInfo(q);

  // Compute the total size of grid
  size_t nsize = n1 * n2 * n3;

  prev = new float[nsize];
  next = new float[nsize];
  vel = new float[nsize];

  // Compute coefficients to be used in wavefield update
  float coeff[kHalfLength + 1] = {-3.0548446,   +1.7777778,     -3.1111111e-1,
                                  +7.572087e-2, -1.76767677e-2, +3.480962e-3,
                                  -5.180005e-4, +5.074287e-5,   -2.42812e-6};

  // Apply the DX, DY and DZ to coefficients
  coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz);
  for (auto i = 1; i <= kHalfLength; i++) {
    coeff[i] = coeff[i] / (dxyz * dxyz);
  }

  // Initialize arrays and introduce initial conditions (source)
  initialize(prev, next, vel, n1, n2, n3);

  auto start = std::chrono::steady_clock::now();

  // Invoke the driver function to perform 3D wave propagation offloaded to
  // the device
  iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations);

  auto end = std::chrono::steady_clock::now();
  auto time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
                  .count();
  printStats(time, n1, n2, n3, num_iterations);

  // Verify result with the CPU serial version
  if (verify) {
    VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations);
  }

  delete[] prev;
  delete[] next;
  delete[] vel;

  return 0;
}

Once the application is created, we can run it from the command line by using few parameters as following:
src/3_GPU_linear 1024 1024 1024 100
<ul>
    <li>bin/3_GPU_linear is the binary</li>
    <li>1024 1024 1024 are the size for the 3 dimensions, increasing it will result in more computation time</li>    
    <li>100 is the number of time steps</li>
</ul>

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpu_linear_usm.sh;if [ -x "$(command -v qsub)" ]; then ./q run_gpu_linear_usm.sh; else ./run_gpu_linear_usm.sh; fi

## ISO3DFD Linear using Buffers and Accessors

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
%%writefile src/3_GPU_linear.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>
#include <chrono>
#include <string>
#include <fstream>

#include "Utils.hpp"

using namespace sycl;

void iso3dfd(sycl::queue &q, float *ptr_next, float *ptr_prev,
                   float *ptr_vel, float *ptr_coeff, size_t n1, size_t n2,
                   size_t n3, size_t n1_block, size_t n2_block, size_t n3_block,
                   size_t end_z, unsigned int nIterations) {
  // Create 3D SYCL range for kernels which not include HALO
  range<3> kernel_range(n1 - 2 * kHalfLength, n2 - 2 * kHalfLength,
                        n3 - 2 * kHalfLength);
  // Create 1D SYCL range for buffers which include HALO
  range<1> buffer_range(n1 * n2 * n3);
  // Create buffers using SYCL class buffer
  buffer next_buf(next, buffer_range);
  buffer prev_buf(prev, buffer_range);
  buffer vel_buf(vel, buffer_range);
  buffer coeff_buf(coeff, range(kHalfLength + 1));

  for (auto it = 0; it < nreps; it++) {
    // Submit command group for execution
    q.submit([&](handler& h) {
      // Create accessors
      accessor next_acc(next_buf, h);
      accessor prev_acc(prev_buf, h);
      accessor vel_acc(vel_buf, h, read_only);
      accessor coeff_acc(coeff_buf, h, read_only);

      // Send a SYCL kernel(lambda) to the device for parallel execution
      // Each kernel runs single cell
      h.parallel_for(kernel_range, [=](id<3> nidx) {
        // Start of device code
        // Add offsets to indices to exclude HALO
        int n2n3 = n2 * n3;
        int i = nidx[0] + kHalfLength;
        int j = nidx[1] + kHalfLength;
        int k = nidx[2] + kHalfLength;

        // Calculate linear index for each cell
        int idx = i * n2n3 + j * n3 + k;

        // Calculate values for each cell
        float value = prev_acc[idx] * coeff_acc[0];
#pragma unroll(8)
        for (int x = 1; x <= kHalfLength; x++) {
          value +=
              coeff_acc[x] * (prev_acc[idx + x]        + prev_acc[idx - x] +
                              prev_acc[idx + x * n3]   + prev_acc[idx - x * n3] +
                              prev_acc[idx + x * n2n3] + prev_acc[idx - x * n2n3]);
        }
        next_acc[idx] = 2.0f * prev_acc[idx] - next_acc[idx] +
                            value * vel_acc[idx];
        // End of device code
      });
    });

    // Swap the buffers for always having current values in prev buffer
    std::swap(next_buf, prev_buf);
  }
}

int main(int argc, char* argv[]) {
  // Arrays used to update the wavefield
  float* prev;
  float* next;
  // Array to store wave velocity
  float* vel;

  // Variables to store size of grids and number of simulation iterations
  size_t n1, n2, n3;
  size_t num_iterations;

  // Flag to verify results with CPU version
  bool verify = false;

  if (argc < 5) {
    Usage(argv[0]);
    return 1;
  }

  try {
    // Parse command line arguments and increase them by HALO
    n1 = std::stoi(argv[1]) + (2 * kHalfLength);
    n2 = std::stoi(argv[2]) + (2 * kHalfLength);
    n3 = std::stoi(argv[3]) + (2 * kHalfLength);
    num_iterations = std::stoi(argv[4]);
    if (argc > 5) verify = true;
  } catch (...) {
    Usage(argv[0]);
    return 1;
  }

  // Validate input sizes for the grid
  if (ValidateInput(n1, n2, n3, num_iterations)) {
    Usage(argv[0]);
    return 1;
  }

  // Create queue and print target info with default selector and in order
  // property
  queue q(default_selector_v, {property::queue::in_order()});
  std::cout << " Running linear indexed GPU version\n";
  printTargetInfo(q);

  // Compute the total size of grid
  size_t nsize = n1 * n2 * n3;

  prev = new float[nsize];
  next = new float[nsize];
  vel = new float[nsize];

  // Compute coefficients to be used in wavefield update
  float coeff[kHalfLength + 1] = {-3.0548446,   +1.7777778,     -3.1111111e-1,
                                  +7.572087e-2, -1.76767677e-2, +3.480962e-3,
                                  -5.180005e-4, +5.074287e-5,   -2.42812e-6};

  // Apply the DX, DY and DZ to coefficients
  coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz);
  for (auto i = 1; i <= kHalfLength; i++) {
    coeff[i] = coeff[i] / (dxyz * dxyz);
  }

  // Initialize arrays and introduce initial conditions (source)
  initialize(prev, next, vel, n1, n2, n3);

  auto start = std::chrono::steady_clock::now();

  // Invoke the driver function to perform 3D wave propagation offloaded to
  // the device
  iso3dfd(q, next, prev, vel, coeff, n1, n2, n3, num_iterations);

  auto end = std::chrono::steady_clock::now();
  auto time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
                  .count();
  printStats(time, n1, n2, n3, num_iterations);

  // Verify result with the CPU serial version
  if (verify) {
    VerifyResult(prev, next, vel, coeff, n1, n2, n3, num_iterations);
  }

  delete[] prev;
  delete[] next;
  delete[] vel;

  return 0;
}

Once the application is created, we can run it from the command line by using few parameters as following:
src/3_GPU_linear 1024 1024 1024 100
<ul>
    <li>bin/3_GPU_linear is the binary</li>
    <li>1024 1024 1024 are the size for the 3 dimensions, increasing it will result in more computation time</li>    
    <li>100 is the number of time steps</li>
</ul>

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpu_linear.sh;if [ -x "$(command -v qsub)" ]; then ./q run_gpu_linear.sh; else ./run_gpu_linear.sh; fi

## ISO3DFD GPU Optimizations

* We started from a code version running with standard C++ on the CPU.
* Using Intel® Offload Advisor, we determined which loop was a good candidate for offload and then using SYCL we worked on a solution to make our code run on the GPU but also on the CPU.
* We identifed the application is bound by Integer opearations.
* And finally we fixed the indexing in the current module to make the code more optimized.
* The next step, is to to run the Roofline Model and VTune to
    * Check the current optimizations to see if we fixed the application being compute and INT bound
    * And look for oppurtunites to optimize further on the GPU to understand if we still have obvious bottlenecks.

### Running the GPU Roofline Analysis
With the offload implemented in 3_GPU_linear using SYCL, we'll want to run roofline analysis to see the improvements we made to the application and look for more areas where there is room for performance optimization.
```
advisor --collect=roofline --profile-gpu --project-dir=./advi_results -- ./myApplication 
```
The iso3DFD GPU Linear code can be run using
```
advisor --collect=roofline --profile-gpu --project-dir=./../advisor/3_gpu -- ./build/src/3_GPU_linear 1024 1024 1024 100
```

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpu_roofline_advisor_usm.sh;if [ -x "$(command -v qsub)" ]; then ./q run_gpu_roofline_advisor_usm.sh; else ./run_gpu_roofline_advisor_usm.sh; fi

### Analyzing the output

From the roofline analysis of the 3_GPU_linear.cpp version, we can see that the performance is close to predicted. 
As noted in the below roofline model we can observe that,

* The Improvements we see are :
    * GINTOPS is 3X lower now compared to the previous version of the  GPU code without linear indexing optimizations. Similary we got more GFLOPS
    * Lesser Data transfer time
    * Higher bandwidth usage
* Bottlenecks we see are:
    * The application is now bounded by memory, specifically by the L3 bandwidth.



<img src="img/gpu_linear.png">

### Roofline Analysis report overview
To display the report, just execute the following frame. In practice, the report will be available in the folder you defined as --out-dir in the previous script. 

[View the report in HTML](reports/advisor_report_linear.html)

In [None]:
from IPython.display import IFrame
display(IFrame(src='reports/advisor_report_linear.html', width=1024, height=768))


## Generating VTune reports
Below exercises we use VTune™  analyzer as a way to see what is going on with each implementation. The information was the high-level hotspot generated from the collection and rendered in an HTML iframe. Depending on the options chosen, many of the VTune analyzer's performance collections can be rendered via HTML pages. The below vtune scripts collect GPU offload and GPU hotspots information.

#### Learn more about VTune
​
There is extensive training on VTune, click [here](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html#gs.2xmez3) to get deep dive training.

```
vtune -run-pass-thru=--no-altstack -collect=gpu-offload -result-dir=vtune_dir -- ./build/src/3_GPU_linear 1024 1024 1024 100
```

```
vtune -run-pass-thru=--no-altstack -collect=gpu-hotspots -result-dir=vtune_dir_hotspots -- ./build/src/3_GPU_linear 1024 1024 1024 100
```

```
vtune -report summary -result-dir vtune_dir -format html -report-output ./reports/output_offload.html
```

```
vtune -report summary -result-dir vtune_dir_hotspots -format html -report-output ./reports/output_hotspots.html
```

[View the report in HTML](reports/output_offload_linear.html)

In [None]:
from IPython.display import IFrame
display(IFrame(src='reports/output_offload_linear.html', width=1024, height=768))


[View the report in HTML](reports/output_hotspots_linear.html)

In [None]:
from IPython.display import IFrame
display(IFrame(src='reports/output_hotspots_linear.html', width=1024, height=768))


### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpu_linear_vtune.sh;if [ -x "$(command -v qsub)" ]; then ./q run_gpu_linear_vtune.sh; else ./run_gpu_linear_vtune.sh; fi

## Summary
### Next Iteration of implemeting GPU Optimizations
We ran the roofline model and observed:

* With the code changes that are in the 3_GPU_linear.cpp file, we can see in the roofline model that the INT operations decreased significantly 
* The kernel now has much lower arithmetic intensity and increased bandwidth
* But now we can see the application is now bounded by memory i.e L3 bandwidth
* In this next iteration, we'll address the problem being memory bound in kernels by increasing the L1 cache reuse.