# Analysis and Optimizing for Performance Portability

In the previous four modules we introduced a basic GEMM and then using SYCL improved the algorithm running the exact same code across four different platforms.  

In this section we  summarize and analyze the results across the progressively improved implementation of the GEMM algorithm in comparison to oneMKL.  We will focus on larger matrix sizes as they provide enough operations to see improvements across the implementations.  By looking at this summary data one can compare the algorithms across the platforms more easily.  

We will explore the impacts of work-group size and implement common code that parameterizes the size at run time for our set of algorithms.  

### Learning objectives

- Review results to determine effectiveness of algorithm implementation across platforms
- Able to articulate how to determine optimal work-group size based on algorithm.
- Able to use SYCL to query for max work-group size and maximum number of compute units.
- Recognize tradeoffs of using a library vs own implementation.

### Execution Time Analysis

Taking a look at the execution times across the platforms we can get an idea for overall algorithm performance as well as platform and accelerator capability.  The kernel execution took place on two different Intel® Xeon® processors, Intel® Gen9 Integrated GPU and Intel® Iris Xe Max discrete GPU.

The oneMKL library gives the best overall performance across all platforms. The local memory implementation of GEMM shows best improvement among the implementations without using any library, in some cases out performs that of oneMKL.

The graph below show execution times across different platforms and algorithm implementations. The matrix size used for collecting the execution times for GEMM algorithms was __20480 x 20480__ and the work-group size in the algorithm implementations were __16x16__

<img src=Assets/ppp_perf.PNG>


### Platform and Accelerator Capability

It is important to consider the Platform and Accelerator Capabilities when implementing the algorithm, for example in the algorithm implementations so far, we used a work-group size of 16x16 (=256) which works for all the platforms used to compute execution times. However different accelerators have different capabilities as to what maximum work-group size or local memory size it supports.

Below are details of different CPUs and GPUs used for executing the kernels and corresponding characteristics obtained using `clinfo`

| | Intel® Gen9 GPU | Intel® Iris Xe Max GPU | Intel® Xeon® Gold 6128 Processor | Intel® Xeon® Platinum 8153 Processor |
|---|---|---|---|---|
|Device type | Integrated GPU| Discrete GPU | CPU | CPU |
|Number of Compute Units | 24 (EU) | 96 (EU) | 12 (Cores) | 64 (Cores) |
|Local Memory Size | 64Kb | 64Kb | 32Kb | 32Kb |
|Max Work-Group Size | 256 | 512 | 8192 | 8192 |

### Query Platforms

So far we have used a work-group size of 16, and while that is not the only valid size, it is often a good place to start for portability across Intel® devices.  In our previous implementation we used the below code to determine our device and query the maximum work-group size.  It was used to identify the platform to make sure we were executing on the right accelerator.  Querying the maximum work-group size was informational.  

```cpp
    std::cout << "Offload Device        : " << q.get_device().get_info<info::device::name>() << "\n";
    std::cout << "max_work_group_size   : " << q.get_device().get_info<info::device::max_work_group_size>() << "\n";
```

Below is the output that resulted, in the run scripts we pass the matrix size and the work-group size manually.  One can specify each with -m and -n switches which will use those values instead.  Not ideal if running across multiple platforms with multiple types of accelerators.  

```bash
Offload Device        : Intel(R) UHD Graphics P630 [0x3e96]
max_work_group_size   : 256
Configuration         : MATRIX_SIZE= 5120x5120 | WORK_GROUP_SIZE= 16x16
```

The four devices we used for computation have different possible maximum work-group sizes (256, 512, 8192). With this information we must consider our algorithm to use a work-group size depending on the device's capability.

### Algorithm Consideration

Our algorithm is a two dimensional general matrix multiply (GEMM) algorithm as shown below.

<img src=Assets/naive.PNG>

Given that, to determine optimum work-group size and to maximize use, you could start by using the square root of the maximum work-group size. 

In the case of a Gen9 GPU, 256 is the max_work_group_size. This works out nicely because the square root of 256 is 16, and the matrix size of 5120 also divides equally by 16. A work-group size of 16x16 is a good candidate.

In the case of an Intel® Iris® Xe MAX GPU, 512 is the max_work_group_size. The square root of 512 is 22.6. Using 22 would provide an invalid work-group size, since it does not divide the matrix size of 5120 equally. We need to figure out what the maximum work-group size is that we can have that also divides the matrix size equally. The valid work-group sizes would be: 20x20, 16x16, 10x10, 8x8, and so on. Maximum does not always mean best result; we have to try different work-group sizes to determine the best result.

In the case of CPUs, 8192 is the max_work_group_size. The valid work-group sizes would be: 80x80, 64x64, 40x40, 32x32, 20x20, 16x16, and so on.


### Impact of Work-group Sizes across different devices

As you can see, there are multiple work-group sizes possible, as they fit into the maximum work-group size in our two-dimensional matrix. The following graph explores the local memory  implementation across all work-group sizes for all parts. oneAPI Math Kernel Library (oneMKL) makes its own determination on optimal work-group size. If you get errors during experimentation, it is likely that you have a wrong work-group size. There is also the scenario where the work-group size functions, but it is not a multiple of 16, and produces register spill that negatively impacts performance.

<img src=Assets/ppp_wg_gpu.PNG>
<img src=Assets/ppp_wg_cpu.PNG>


### Optimal Work-Group size for Performance Portability

In the following pseudo code, we query for maximum supported work-group size for the device and then compute all the valid work-group sizes that can be used in the algorithm. It took experimentation to determine what values worked and produced the best results. Based on this experimentation we determined an approach to choose an optimal work-group size to use for a device. 

This new common code works for all platforms and does not impact the algorithm implementations, just the size of work-groups used. The new, more in-depth platform query output is shown below. All are valid; not all produce good results, and this is where the experimentation needs to take place. The following chart illustrates the performance of varying work-group sizes.

```cpp
    
    // find valid work-group sizes to try for performance.
    std::vector<int> work_group_sizes;
    auto max_work_group_size = q.get_device().get_info<info::device::max_work_group_size>();
    int work_group_dim_size = sqrt(max_work_group_size);
    work_group_dim_size = work_group_dim_size - work_group_dim_size % 2; 
    while (work_group_dim_size >= 2){
        if (N % work_group_dim_size == 0) work_group_sizes.push_back(work_group_dim_size);
        work_group_dim_size =  work_group_dim_size - 2;
    }
    std::cout << "valid_wg_sizes        : " ;
    for(int i=0;i<work_group_sizes.size();i++) std::cout << work_group_sizes[i] << "x" << work_group_sizes[i] << " ";
    std::cout << "\n";
    
    // find optimal work-group size for the offload device
    int optimal_work_group_dim_size = 0;
    for(int i=0;i<work_group_sizes.size();i++){
        if(work_group_sizes[i] % 8 == 0) {optimal_work_group_dim_size = work_group_sizes[i]; break;}
    }
    for(int i=0;i<work_group_sizes.size();i++){
        if(work_group_sizes[i] % 16 == 0) {optimal_work_group_dim_size = work_group_sizes[i]; break;}
    }
    for(int i=0;i<work_group_sizes.size();i++){
        if(work_group_sizes[i] % 32 == 0) {optimal_work_group_dim_size = work_group_sizes[i]; break;}
    }
    std::cout << "optimal_wg_size       : " << optimal_work_group_dim_size << "x" << optimal_work_group_dim_size << "\n";
    if(M ==0) M = optimal_work_group_dim_size;
```

Using the above code to determine optimal work-group size based on offload device, we can re-calculate the execution times for each algorithm implementations. The output will now print all the valid work-group sizes and optimal work-group size that will be used for the device.

<img src=Assets/optimal_wg_size.PNG>

#### Compile and Run with work-group optimization

The new common code source file is [mm_dpcpp_common_wg.cpp](lab/mm_dpcpp_common_wg.cpp), we use this to compile the different kernel implementations.

1. Run the cell in the __Select Offload Device__ section to choose a target device to run the code on.
2. Next, run the cell in the __Build and Run__ section to compile and execute the local memory implementation of code with work-group optimization

In [None]:
run accelerator.py

#### Build and Run
Select the cell below and click __Run__ ▶ to compile and execute the code on selected device:

In [None]:
! chmod 755 q; chmod 755 run_mm_localmem_wg.sh; if [ -x "$(command -v qsub)" ]; then ./q run_mm_localmem_wg.sh "{device.value}"; else ./run_mm_localmem_wg.sh; fi

### Performance Portability Analysis

Using the above code to determine optimal work-group size based on offload device, we calculated the executions times for various implementations.

Just focusing on the valid work-group sizes you can see that for a CPU in our case a work-group size of 64 always yields a better result. With a work-group size of 64 the local memory implementation is able to outperform the library. These graphs represent matrix size of 20480x20480.

<img src=Assets/ppp_perfopt.PNG>


We can also compare performances with different matrix sizes, you can see that as the matrix size is increased the local memory optimized algorithm and MKL algorithm performs better, but not so much when the matrix size is small.
<img src=Assets/ppp_all.png>

### Using oneAPI GPU Optimization Guide

By using just local memory optimization and tuning for work-group size we were able to see significant performance improvements, in some hardware configuration better performance than Math Kernel Library.

There are a lot of optimizations that can be done by following the __[oneAPI GPU Optimization Guide](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/intro.html)__

#### oneAPI GPU Optimization Guide

In the link above we cover topics related to the coding, submission, and execution of kernels.
- Reduction
- Sub-groups
- Avoiding Register Spills
- Shared Local Memory
- Removing Conditional Checks
- Kernel Launch
- Using Libraries for Accelerator Offload
- Using Standard Library Functions in SYCL Kernels
- Efficiently Implementing Fourier Correlation Using oneAPI Math Kernel Library (oneMKL)
- Executing Multiple Kernels on the Device at the Same Time
- Synchronization among Threads in a Kernel
- Restrict Directive
- Submitting Kernels to Multiple Queues
- Avoid Redundant Queue Construction
- Considerations for selecting work-group size

### Summary

In this section, we explored how to query the platforms for more detailed information to make better choices with respect to work-group size. In addition, we discussed why it's important to understand the algorithm that you are using and how it impacts the choice of work-group size. Finally, we demonstrated the impact of the parameterization on performance, and yielded even more speedup, ultimately rivaling oneMKL performance in this scenario.
 
It should be noted that by writing our own parameterization scheme we had to create a lot more lines of code than would be required if just using the oneMKL library.

This path using oneAPI and SYCL provides a methodology to use when choosing to use a library or, if not available, how to go about using SYCL and oneAPI to write heterogenous code.

In the next section we expand on our VTune™ analysis.

- Notices
Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. *Other names and brands may be claimed as the property of others.
