# ISO3DFD using nd_range kernel

## Learning Objectives

<ul>
    <li>Understand how to further optimize the application using L1 cache reusage</li>    
    <li>Run roofline analysis and the VTune reports again to gauge the results</li>
    
</ul>

## Iso3DFD using nd_range kernel

In the previous activity, we used Intel® Advisor roofline analysis to decide on if the application is memory bound and specifically that the kernels have less cache reuse and we are bounded by the L3 memory bound which is all about re-using the memory.

In this notebook, we'll address the problem being L3 memory bound in kernels by using dedicated cache reuse memory.

The tuning puts more work in each local work group, which optimizes loading neighboring stencil points from the fast L1 cache.

To do this we need to change the kernel to nd_range; now they will not calculate only one cell but will iterate so that it schedules 1024 x 1 x 1 grid points on each SIMD16 core and all 1024 points share an L1 cache. The previous activity we schedule 16 x 1 x 1 grid points on each SIMD16 core and only 16 points share L1 cache.

We can change the parameters passed to the application to find the best load for each work group. 


## Optimizing using nd_range kernel
The 4_GPU_optimized version of the sample addresses the memory issue constraints where we'll reuse data from L1 cache resue, where it schedules 1024 x 1 x 1 grid points on each SIMD16 core and all 1024 points share an L1 cache.



```

// Create USM objects 
  auto next = sycl::aligned_alloc_device<float>(64, grid_size + 16, q);
  next += (16 - kHalfLength);
  q.memcpy(next, ptr_next, sizeof(float)*grid_size);
  auto prev = sycl::aligned_alloc_device<float>(64, grid_size + 16, q);
  prev += (16 - kHalfLength);
  q.memcpy(prev, ptr_prev, sizeof(float)*grid_size);
  auto vel = sycl::aligned_alloc_device<float>(64, grid_size + 16, q);
  vel += (16 - kHalfLength);
  q.memcpy(vel, ptr_vel, sizeof(float)*grid_size);
  //auto coeff = sycl::aligned_alloc_device<float>(64, grid_size + 16, q);
  auto coeff = sycl::aligned_alloc_device<float>(64, kHalfLength+1 , q);
  q.memcpy(coeff, ptr_coeff, sizeof(float)*(kHalfLength+1));  
  q.wait();  
```

* The following integer function rounds N up to next multiple of M. Global nd_range must be integer multiple of local nd_range, so global nd_range is rounded to next multiple of local nd_range.  A conditional statement is added to ensure any extra work items do no work.

```
// Create 1D SYCL range for buffers which include HALO
range<1> buffer_range(n1 * n2 * n3);
auto global_nd_range = range<3>((n3-2*kHalfLength+n3_block-1)/n3_block*n3_block,(n2-2*kHalfLength+n2_block-1)/n2_block*n2_block,n1_block);

```
* Change parallel_for to use nd_range. Here each work-item is doing more work reading from faster L1 cache.

```
q.submit([&](auto &h) {      
        h.parallel_for(
              nd_range(global_nd_range, local_nd_range), [=](auto item)          
         {
            const int iz = kHalfLength + item.get_global_id(0);
            const int iy = kHalfLength + item.get_global_id(1);
            if (iz < n3 - kHalfLength && iy < n2 - kHalfLength)
             for (int ix = kHalfLength+item.get_global_id(2); ix < n1 - kHalfLength; ix += n1_block)
                {
                  auto gid = ix + iy*nx + iz*nxy;
                  float *pgid = prev+gid;
                  auto value = coeff[0] * pgid[0];
#pragma unroll(kHalfLength)
                  for (auto iter = 1; iter <= kHalfLength; iter++)
                    value += coeff[iter]*(pgid[iter*nxy] + pgid[-iter*nxy] + pgid[iter*nx] + pgid[-iter*nx] + pgid[iter] + pgid[-iter]);
                  next[gid] = 2.0f*pgid[0] - next[gid] + value*vel[gid];
                }
      });    
    }).wait();
   std::swap(next, prev);

```
We will run roofline analysis and the VTune reports again to gauge the results.

The SYCL code below shows Iso3dFD GPU code using SYCL with Index optimizations: Inspect code, there are no modifications necessary:
1. Inspect the code cell below and click run ▶ to save the code to file
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile src/4_GPU_optimized.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>
#include <chrono>
#include <string>
#include <fstream>

#include "Utils.hpp"

using namespace sycl;

void iso3dfd(queue& q, float* ptr_next, float* ptr_prev, float* ptr_vel, float* ptr_coeff,
             const size_t n1, const size_t n2, const size_t n3,size_t n1_block, size_t n2_block, size_t n3_block,
             const size_t nIterations) {
  auto nx = n1;
  auto nxy = n1*n2;
  auto grid_size = nxy*n3;  

  auto next = sycl::aligned_alloc_device<float>(64, grid_size + 16, q);
  next += (16 - kHalfLength);
  q.memcpy(next, ptr_next, sizeof(float)*grid_size);
  auto prev = sycl::aligned_alloc_device<float>(64, grid_size + 16, q);
  prev += (16 - kHalfLength);
  q.memcpy(prev, ptr_prev, sizeof(float)*grid_size);
  auto vel = sycl::aligned_alloc_device<float>(64, grid_size + 16, q);
  vel += (16 - kHalfLength);
  q.memcpy(vel, ptr_vel, sizeof(float)*grid_size);
  //auto coeff = sycl::aligned_alloc_device<float>(64, grid_size + 16, q);
  auto coeff = sycl::aligned_alloc_device<float>(64, kHalfLength+1 , q);
  q.memcpy(coeff, ptr_coeff, sizeof(float)*(kHalfLength+1));  
  q.wait();  
				  
  auto local_nd_range = range<3>(n3_block,n2_block,n1_block);
  auto global_nd_range = range<3>((n3-2*kHalfLength+n3_block-1)/n3_block*n3_block,(n2-2*kHalfLength+n2_block-1)/n2_block*n2_block,n1_block);
  

  for (auto i = 0; i < nIterations; i += 1) {
    q.submit([&](auto &h) {      
        h.parallel_for(
              nd_range(global_nd_range, local_nd_range), [=](auto item)          
         {
            const int iz = kHalfLength + item.get_global_id(0);
            const int iy = kHalfLength + item.get_global_id(1);
            if (iz < n3 - kHalfLength && iy < n2 - kHalfLength)
             for (int ix = kHalfLength+item.get_global_id(2); ix < n1 - kHalfLength; ix += n1_block)
                {
                  auto gid = ix + iy*nx + iz*nxy;
                  float *pgid = prev+gid;
                  auto value = coeff[0] * pgid[0];
#pragma unroll(kHalfLength)
                  for (auto iter = 1; iter <= kHalfLength; iter++)
                    value += coeff[iter]*(pgid[iter*nxy] + pgid[-iter*nxy] + pgid[iter*nx] + pgid[-iter*nx] + pgid[iter] + pgid[-iter]);
                  next[gid] = 2.0f*pgid[0] - next[gid] + value*vel[gid];
                }
      });    
    }).wait();
   std::swap(next, prev);
  }
  q.memcpy(ptr_prev, prev, sizeof(float)*grid_size);

  sycl::free(next - (16 - kHalfLength),q);
  sycl::free(prev - (16 - kHalfLength),q);
  sycl::free(vel - (16 - kHalfLength),q);
  sycl::free(coeff,q);  

}

int main(int argc, char* argv[]) {
  // Arrays used to update the wavefield
  float* prev;
  float* next;
  // Array to store wave velocity
  float* vel;

  // Variables to store size of grids and number of simulation iterations
  size_t n1, n2, n3;
    size_t n1_block, n2_block, n3_block;
  size_t num_iterations;

  // Flag to verify results with CPU version
  bool verify = false;

  if (argc < 5) {
    Usage(argv[0]);
    return 1;
  }

  try {
    // Parse command line arguments and increase them by HALO
    n1 = std::stoi(argv[1]) + (2 * kHalfLength);
    n2 = std::stoi(argv[2]) + (2 * kHalfLength);
    n3 = std::stoi(argv[3]) + (2 * kHalfLength);
    n1_block = std::stoi(argv[4]);
    n2_block = std::stoi(argv[5]);
    n3_block = std::stoi(argv[6]);
    num_iterations = std::stoi(argv[7]);    
  } catch (...) {
    Usage(argv[0]);
    return 1;
  }

  // Validate input sizes for the grid
  if (ValidateInput(n1, n2, n3, num_iterations)) {
    Usage(argv[0]);
    return 1;
  }

  // Create queue and print target info with default selector and in order
  // property
  queue q(default_selector_v, {property::queue::in_order()});
  std::cout << " Running nd_range GPU version\n";
  printTargetInfo(q);

  // Compute the total size of grid
  size_t nsize = n1 * n2 * n3;

  prev = new float[nsize];
  next = new float[nsize];
  vel = new float[nsize];

  // Compute coefficients to be used in wavefield update
  float coeff[kHalfLength + 1] = {-3.0548446,   +1.7777778,     -3.1111111e-1,
                                  +7.572087e-2, -1.76767677e-2, +3.480962e-3,
                                  -5.180005e-4, +5.074287e-5,   -2.42812e-6};

  // Apply the DX, DY and DZ to coefficients
  coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz);
  for (auto i = 1; i <= kHalfLength; i++) {
    coeff[i] = coeff[i] / (dxyz * dxyz);
  }

  // Initialize arrays and introduce initial conditions (source)
  initialize(prev, next, vel, n1, n2, n3);

  auto start = std::chrono::steady_clock::now();

  // Invoke the driver function to perform 3D wave propagation offloaded to
  // the device
  iso3dfd(q, next, prev, vel, coeff, n1, n2, n3,n1_block,n2_block,n3_block, num_iterations);

  auto end = std::chrono::steady_clock::now();
  auto time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
                  .count();
  printStats(time, n1, n2, n3, num_iterations);  

  delete[] prev;
  delete[] next;
  delete[] vel;

  return 0;
}

Once the application is created, we can run it from the command line by using few parameters as following:
src/4_GPU_optimized /4_GPU_optimized 1024 1024 1024 32 8 4 100
<ul>
    <li>bin/4_GPU_optimized is the binary</li>
    <li>/1024 1024 1024 32 8 4 100 are the size for the 3 dimensions, increasing it will result in more computation time</li>    
    <li>100 is the number of time steps</li>
</ul>

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpu_optimized.sh;if [ -x "$(command -v qsub)" ]; then ./q run_gpu_optimized.sh; else ./run_gpu_optimized.sh; fi

## ISO3DFD GPU Optimizations

* We started from a code version running with standard C++ on the CPU.
* Using Intel® Offload Advisor, we determined which loop was a good candidate for offload and then using SYCL we worked on a solution to make our code run on the GPU but also on the CPU.
* We identifed the application is bound by Integer opearations.
* We fixed the indexing to make the code more optimized with reduced INT operations
* we are going to check how the implementation of L1 cache reusage works
* The next step, is to to run the Roofline Model and VTune to
    * Check the current optimizations using L1 cache reusage.  

### Running the GPU Roofline Analysis
With the offload implemented in 4_GPU_optimized using SYCL, we'll want to run roofline analysis to see the improvements we made to the application and look for more areas where there is room for performance optimization.
```
advisor --collect=roofline --profile-gpu --project-dir=./advi_results -- ./myApplication 
```
The iso3DFD GPU optimized code can be run using
```
advisor --collect=roofline --profile-gpu --project-dir=./../advisor/4_gpu -- ./build/src/4_GPU_optimized 1024 1024 1024 32 8 4 100
```

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpu_roofline_advisor.sh;if [ -x "$(command -v qsub)" ]; then ./q run_gpu_roofline_advisor.sh; else ./run_gpu_roofline_advisor.sh; fi

### Analyzing the HTML report


As noted in the below roofline model we can observe that,

* We can observe it is bounded by HBM memory
* Still lesser INT operations.
* High HBM traffic
* Higher Threading occupancy



<img src="img/4_iso.png">

### Roofline Analysis report overview
To display the report, just execute the following frame. In practice, the report will be available in the folder you defined as --out-dir in the previous script. 

[View the report in HTML](reports/advisor-report_linear.html)

In [None]:
from IPython.display import IFrame
display(IFrame(src='reports/advisor-report.html', width=1024, height=768))


## Generating VTune reports
Below exercises we use VTune™  analyzer as a way to see what is going on with each implementation. The information was the high-level hotspot generated from the collection and rendered in an HTML iframe. Depending on the options chosen, many of the VTune analyzer's performance collections can be rendered via HTML pages. The below vtune scripts collect GPU offload and GPU hotspots information.

#### Learn more about VTune
​
There is extensive training on VTune, click [here](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html#gs.2xmez3) to get deep dive training.

```
vtune -run-pass-thru=--no-altstack -collect=gpu-offload -result-dir=vtune_dir -- ./build/src/3_GPU_linear 256 256 256 100
```

```
vtune -run-pass-thru=--no-altstack -collect=gpu-hotspots -result-dir=vtune_dir_hotspots -- ./build/src/3_GPU_linear 256 256 256 100
```

```
vtune -report summary -result-dir vtune_dir -format html -report-output ./reports/output_offload.html
```

```
vtune -report summary -result-dir vtune_dir_hotspots -format html -report-output ./reports/output_hotspots.html
```

[View the report in HTML](reports/output_offload_linear.html)

In [None]:
from IPython.display import IFrame
display(IFrame(src='reports/output_offload_linear.html', width=1024, height=768))


[View the report in HTML](reports/output_hotspots_linear.html)

In [None]:
from IPython.display import IFrame
display(IFrame(src='reports/output_hotspots_linear.html', width=1024, height=768))


### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpu_linear_vtune.sh;if [ -x "$(command -v qsub)" ]; then ./q run_gpu_linear_vtune.sh; else ./run_gpu_linear_vtune.sh; fi

## Summary
* We started from a code version running with standard C++ on the CPU. * Using Intel® Offload Advisor, we determined which loop was a good candidate for offload
* Using SYCL we worked on a solution to make our code run on the GPU but also on the CPU.
* In the first iteration We identifed the application is bound by Integer opearations and we fixed the indexing to make it more optimized.
* The last step we tune by adding more work in each local work group, which optimizes loading neighboring stencil points from the fast L1 cache
