# ISO3DFD and Offload Advisor Analysis

## Learning Objectives


<ul>
    <li>To run offload Advisor and generate a HTML report</li>
    <li>To read and understand the metrics in the report</li>
    <li>To get a performance estimation of your application on the target hardware</li>
    <li>To decide which loops are good candidate for offload</li>
</ul>

## ISO3DFD Application basics

In this module, initially we will assume that the developer already has a code running on a CPU. At this stage, it doesn't matter if the code is written in C/C++ or Fortran. Before porting a code on a GPU, the developer should try to understand which parts of the code should be offloaded on the GPU. This step is not always trivial because the developer needs to understand the code but also the hardware that will be used for offloading the computations.
The goal of this activity is to show how Intel® Advisor can help deciding what part of the code should or should not be offloaded on the GPU. At the end of this activity, you will be able:

Iso3DFD is a wave propagation kernel used in Oil and Gas applications. The resolution of the wave equation is based on finite differences which results in implementing a stencil in a 3D volume.

![3D Stencil](img/stencil_mount.png)

The general algorithm can be described as follow, using next and prev to store the pressure and vel to store velocity: <br />

iterate over time steps<br />
|&emsp;    iterate over Z <br />
|&emsp;    |&emsp;    iterate over Y <br />
|&emsp;    |&emsp;    |&emsp;    iterate over X <br />
|&emsp;    |&emsp;    |&emsp;    |&emsp;    tmp = compute stencil for prev[x,y,z] <br />
|&emsp;    |&emsp;    |&emsp;    |&emsp;    next[x,y,z] = update(prev[x,y,z], next[x,y,z], vel[x,y,z]) <br />
|&emsp;    swap(prev, next) <br />

If we try to extract a 2D cut of the volume at different time steps, we can see a perturbation evolving and reflecting on the edges.
<table style="text-align:left">
    <tr>
        <th><img src='img/prop2.png' alt='Propagation at T10'/></th>
        <th><img src='img/prop3.png' alt='Propagation at T20'/></th>
        <th><img src='img/prop4.png' alt='Propagation at T30'/></th>
        <th><img src='img/prop5.png' alt='Propagation at T40'/></th>
   </tr>
   <tr>
        <th style="text-align:center">Propagation at t10</th>
        <th style="text-align:center">Propagation at t20</th>
        <th style="text-align:center">Propagation at t30</th>
        <th style="text-align:center">Propagation at t40</th>
  </tr>
</table>


## Compiling and running iso3DFD 

The first step will be to compile and run for the first time this application. Below is the step by step guide that shows how to optimize iso3dfd. We'll start with code that runs on the CPU, then a basic implementation of GPU offload, then make several iterations to optimize the code. The below uses the Intel® Advisor analysis tool to provide performance analysis of the built applications.



## Offloading modeling
The first step is to run offload modeling on the CPU only version of the application (1_CPU_only) to identify code regions that are good opportunities for GPU offload. Running accurate modeling can take considerable time as Intel® Advisor performs analysis on your project. There are two commands provided below. The first is fast, but less accurate and should only be used as a proof of concept. The second will give considerably more helpful and accurate profile information. Depending on your system, modeling may take well over an hour.

The SYCL code below shows CPU code: Inspect code, there are no modifications necessary:
1. Inspect the code cell below and click run ▶ to save the code to file
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile src/1_CPU_only.cpp
//==============================================================
// Copyright   2022 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <chrono>
#include <string>
#include <fstream>

#include "Utils.hpp"

void inline iso3dfdIteration(float* ptr_next_base, float* ptr_prev_base,
                             float* ptr_vel_base, float* coeff, const size_t n1,
                             const size_t n2, const size_t n3) {
  auto dimn1n2 = n1 * n2;

  // Remove HALO from the end
  auto n3_end = n3 - kHalfLength;
  auto n2_end = n2 - kHalfLength;
  auto n1_end = n1 - kHalfLength;

  for (auto iz = kHalfLength; iz < n3_end; iz++) {
    for (auto iy = kHalfLength; iy < n2_end; iy++) {
      // Calculate start pointers for the row over X dimension
      float* ptr_next = ptr_next_base + iz * dimn1n2 + iy * n1;
      float* ptr_prev = ptr_prev_base + iz * dimn1n2 + iy * n1;
      float* ptr_vel = ptr_vel_base + iz * dimn1n2 + iy * n1;

      // Iterate over X
      for (auto ix = kHalfLength; ix < n1_end; ix++) {
        // Calculate values for each cell
        float value = ptr_prev[ix] * coeff[0];
        for (int i = 1; i <= kHalfLength; i++) {
          value +=
              coeff[i] *
               (ptr_prev[ix + i] + ptr_prev[ix - i] +
                ptr_prev[ix + i * n1] + ptr_prev[ix - i * n1] +
                ptr_prev[ix + i * dimn1n2] + ptr_prev[ix - i * dimn1n2]);
        }
        ptr_next[ix] = 2.0f * ptr_prev[ix] - ptr_next[ix] + value * ptr_vel[ix];
      }
    }
  }
}

void iso3dfd(float* next, float* prev, float* vel, float* coeff,
             const size_t n1, const size_t n2, const size_t n3,
             const size_t nreps) {
  for (auto it = 0; it < nreps; it++) {
    iso3dfdIteration(next, prev, vel, coeff, n1, n2, n3);
    // Swap the pointers for always having current values in prev array
    std::swap(next, prev);
  }
}

int main(int argc, char* argv[]) {
  // Arrays used to update the wavefield
  float* prev;
  float* next;
  // Array to store wave velocity
  float* vel;

  // Variables to store size of grids and number of simulation iterations
  size_t n1, n2, n3;
  size_t num_iterations;

  if (argc < 5) {
    Usage(argv[0]);
    return 1;
  }

  try {
    // Parse command line arguments and increase them by HALO
    n1 = std::stoi(argv[1]) + (2 * kHalfLength);
    n2 = std::stoi(argv[2]) + (2 * kHalfLength);
    n3 = std::stoi(argv[3]) + (2 * kHalfLength);
    num_iterations = std::stoi(argv[4]);
  } catch (...) {
    Usage(argv[0]);
    return 1;
  }

  // Validate input sizes for the grid
  if (ValidateInput(n1, n2, n3, num_iterations)) {
    Usage(argv[0]);
    return 1;
  }

  // Compute the total size of grid
  size_t nsize = n1 * n2 * n3;

  prev = new float[nsize];
  next = new float[nsize];
  vel = new float[nsize];

  // Compute coefficients to be used in wavefield update
  float coeff[kHalfLength + 1] = {-3.0548446,   +1.7777778,     -3.1111111e-1,
                                  +7.572087e-2, -1.76767677e-2, +3.480962e-3,
                                  -5.180005e-4, +5.074287e-5,   -2.42812e-6};

  // Apply the DX, DY and DZ to coefficients
  coeff[0] = (3.0f * coeff[0]) / (dxyz * dxyz);
  for (auto i = 1; i <= kHalfLength; i++) {
    coeff[i] = coeff[i] / (dxyz * dxyz);
  }

  // Initialize arrays and introduce initial conditions (source)
  initialize(prev, next, vel, n1, n2, n3);

  std::cout << "Running on CPU serial version\n";
  auto start = std::chrono::steady_clock::now();

  // Invoke the driver function to perform 3D wave propagation 1 thread serial
  // version
  iso3dfd(next, prev, vel, coeff, n1, n2, n3, num_iterations);

  auto end = std::chrono::steady_clock::now();
  auto time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
                  .count();

  printStats(time, n1, n2, n3, num_iterations);

  delete[] prev;
  delete[] next;
  delete[] vel;

  return 0;
}

Once the application is created, we can run it from the command line by using few parameters as following:
src/1_CPU_only 256 256 256 100
<ul>
    <li>bin/1_CPU_only is the binary</li>
    <li>128 128 128 are the size for the 3 dimensions, increasing it will result in more computation time</li>    
    <li>100 is the number of time steps</li>
</ul>

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_cpu_only.sh;if [ -x "$(command -v qsub)" ]; then ./q run_cpu_only.sh; else ./run_cpu_only.sh; fi

Now that you have been able to compile and execute the code, let's start profiling what should be offloaded !

## Running Offload Advisor

The current code is running on a CPU and is actually not even threaded. For Intel® Offload Advisor, it doesn't matter if your code is already threaded. Advisor will run several analyses on your application to extract several metric such as the number of operations, the number of memory transfers, data dependencies and many more.
We are going to detail each of these steps. Remember that our goal here is to decide if some of our loops are good candidates for offload. In this section, we will generate the report assuming that we want to offload our computations on a GPU on Intel Devcloud.
Keep in mind that if you want Advisor to extract as much information as possible, you need to compile your application with debug information (-g with intel compilers).

The first step is to run offload modeling on the CPU only version of the application (1_CPU_only) to identify code regions that are good opportunities for GPU offload. Running accurate modeling can take considerable time as Intel® Advisor performs analysis on your project. There are two commands provided below. The first is fast, but less accurate and should only be used as a proof of concept. The second will give considerably more helpful and accurate profile information. Depending on your system, modeling may take well over an hour.

Run one of the following from the from the "build" directory
```
advisor --collect=offload --config=pvc_xt_448xve --project-dir=./../advisor/1_cpu -- ./build/src/1_CPU_only 256 256 256 20

```

### Simple method: Use Collection Presets
For the Offload Modeling perspective, Intel Advisor has a special collection mode --collect=offload that allows you to run several analyses using only oneIntel Advisor CLI command. When you run the collection, it sequentially runs data collection and performance modeling steps.
 In the commands below, make sure to replace the myApplication with your application executable path and name before executing a command. If your application requires additional command line options, add them after the executable name.
```
advisor --collect=offload --project-dir=./advi_results -- ./myApplication 
```
The iso3DFD CPU code can be run using
```
advisor --collect=offload --config=pvc_xt_448xve --project-dir=./../advisor/1_cpu -- ./build/src/1_CPU_only 256 256 256 20

```

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_offload_advisor.sh;if [ -x "$(command -v qsub)" ]; then ./q run_offload_advisor.sh; else ./run_offload_advisor.sh; fi

## Second Method to run the Offload Advisor

### Running the Survey

The Survey is usually the first analysis you want to run with Intel® Advisor. The survey is mainly used to time your application as well as the different loops and functions. There is a minimal performance penalty at this stage. This analysis is also used to extract information embedded by the compiler in your binary. These information are mainly related to vectorization (why or why not vectorization, vectorization efficiency, etc).

```
advisor --collect=survey --auto-finalize --static-instruction-mix -- ./build/src/1_CPU_only 128 128 128 20
```

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_advisor_survey.sh;if [ -x "$(command -v qsub)" ]; then ./q run_advisor_survey.sh; else ./run_advisor_survey.sh; fi

### Running the trip count and cache simulation 
The second step to decide what should be offloaded, will be to run the trip count analysis as well as the cache simulation. This second step uses instrumentation to count how many iterations you are running in each loops. Adding the option -flop will also provide the precise number of operations executed in each of your code sections.

In this step, we also ask advisor to run a cache simulation, specifying the memory configuration of the hardware we are targeting for offload

Be aware that this step will take much more time than simply running your application. You can expect something like a 10x speed-down due to the many parameters Advisor tries to extract during the run.
```
advisor --collect=tripcounts --flop --auto-finalize --target-device=gen9_gt2 -- ./build/src/1_CPU_only 128 128 128 20
```

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_advisor_tripcounts.sh;if [ -x "$(command -v qsub)" ]; then ./q run_advisor_tripcounts.sh; else ./run_advisor_tripcounts.sh; fi

### Optional: Dependency analysis

Forcing threading in location where it is not supposed to happen might be quite dangerous and result in computation changes. In order to avoid parallelizing loops that cannot be parallelized, it is possible to run an additional analysis called the dependency analysis. This step was initially used to help users implementing vectorization but Offload Advisor can also use it to recommend what can be offloaded or not.

```
advisor -collect=dependencies --loop-call-count-limit=16 --select markup=gpu_generic --filter-reductions --project-dir=./advi_results -- ./myApplication
```

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_cpu_only.sh;if [ -x "$(command -v qsub)" ]; then ./q run_cpu_only.sh; else ./run_cpu_only.sh; fi

### Analyzing the HTML report

We finally reached the last step and only need to generate our HTML report for offloading on GPU. This report will show us:
<ul>
    <li>What is the expected speedup on the target device</li>
    <li>What will most likely be our bottleneck on the target device</li>
    <li>What are the good candidates for offload</li>
    <li>What are the loops that should not be offloaded</li>
</ul>

### Advisor report overview
To display the report, just execute the following frame. In practice, the report will be available in the folder you defined as --out-dir in the previous script.

[View the report in HTML](reports/advisor_report_overview.html)

In [None]:
from IPython.display import IFrame

IFrame(src='reports/advisor-report.html', width=900, height=600)

In [None]:
<html><body><span style="color:green"><h1>Survey</h1></span></body></html>

[Tell us how we did in this module with a short survey. We will use your feedback to improve the quality and impact of these learning materials. Thanks!](https://intel.az1.qualtrics.com/jfe/form/SV_6m4G7BXPNSS7FBz)


from IPython.display import IFrame

IFrame(src='reports/advisor_report_overview.html', width=900, height=600)

### Advisor report
To display the report, just execute the following frame. In practice, the report will be available in the folder you defined as --out-dir in the previous script. 

[View the report in HTML](reports/report.html)

In [None]:
from IPython.display import IFrame

IFrame(src='reports/report.html', width=900, height=600)

Navigate in the report and try to understand what should be the speedup, what should be offloaded and what should not be offloaded. Navigate also to the "Offloaded Regions" tab to see exactly which part of the code should run on the GPU.

### How to remember these complex command lines ? 

You might think that the command lines we used are too complex to be remembered and you are right ! This is the reason why Advisor provides an option called --dry-run that will give you all the independent commands you need to use to run this analysis from scratch.

Generate pre-configured command lines with --collect=offload and the --dry-run option.
The option generates:
* Commands for the Intel Advisor CLI collection workflow
* Commands that correspond to a specified accuracy level

```
advisor --collect=offload --accuracy=low --dry-run --project-dir=./advi_results -- ./myApplication
```

```
advisor --collect=offload --accuracy=low --dry-run -- ./build/src/1_CPU_only 128 128 128 20
```
--config can use the following devices:
<ul>    
    <li>pvc_xt_448xve</li>
    <li>xehpg_512xve</li>
    <li>xehpg_256xve</li>
    <li>gen12_tgl</li>
    <li>gen12_dg1</li>
    <li>gen11_icl</li>
    <li>gen11_gt2</li>    
    <li>gen9_gt2</li>
    <li>gen9_gt3</li>
    <li>gen9_gt4</li>
    
</ul>

In [None]:
! chmod 755 q; chmod 755 run_dry_run_advisor; if [ -x "$(command -v qsub)" ]; then ./q run_dry_run_advisor.sh; else ./run_dry_run_advisor.sh; fi

## Summary
### Next Iteration of implemeting the parallelism using SYCL
In this module

* Started with serial C++ code that runs on the CPU. 
* Used the Intel® Advisor analysis tool to provide performance analysis/projections of the application.
* Ran offload modeling on the CPU version of the application to identify code regions that are good opportunities for GPU offload.
* Reviewed the Offload report and we are ready to build an implementation of GPU offload using SYCL
* We will also make several iterations of the SYCL code to optimize the code for GPUs