# Math Kernel Library (oneMKL) and SYCL Basic Parallel Kernel

In this next set of modules, we will explore how we can utilize oneAPI and SYCL to implement Matrix Multiplication using Intel® oneAPI Math Kernel Library (oneMKL) and also implement Matrix Multiplication in the most basic parallel form and then improve performance by tuning the kernel code while trying to maintain performance portability. All code improvements will be measured in terms of relative performance to oneMKL.

### Learning Objectives
- Gain familiarity with oneMKL and able to use it for a two dimensional GEMM.
- Use a basic GEMM application for basis of enhancements.
- Interpret roofline and VTune™ analyzer results as a method to measure the GEMM applications.


## Intel oneAPI Math Kernel Library (oneMKL)

One of the best ways to achieve performance portable code is to take advantage of a library.  In this case oneMKL offers a compelling GEMM implementation that we will use as our baseline.  If there is a library, it should always be ones first attempt at achieving performant portable code.  All other implementations will be measured against oneMKL.  

Intel oneMKL is included in the Intel oneAPI toolkits and there is extensive documentation at [Get Started with Intel oneAPI Math Kernel Library.](https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-mkl-for-dpcpp/top.html)

The Intel® oneAPI Math Kernel Library (oneMKL) helps you achieve maximum performance with a math computing library of highly optimized, extensively parallelized routines for CPU and GPU. The library has C and Fortran interfaces for most routines on CPU, and SYCL interfaces for some routines on both CPU and GPU. You can find comprehensive support for several math operations in various interfaces including:

SYCL on CPU and GPU 
(Refer to the Intel® oneAPI Math Kernel Library—Data Parallel C++ Developer Reference for more details.)
- Linear algebra
- BLAS
- Selected Sparse BLAS functionality
- Selected LAPACK functionality
- Fast Fourier Transforms (FFT)
- 1D r2c FFT
- 1D c2c FFT
- Random number generators
- Single precision Uniform, Gaussian, and Lognormal distributions
- Selected Vector Math functionality

The example below uses the GEMM function from the oneMKL BLAS routine,  which computes a scalar-matrix-matrix product and add the result to a scalar-matrix product, with general matrices. The operation is defined as:

```cpp
void gemm(queue &exec_queue, transpose transa, transpose transb, std::int64_t m, std::int64_t n, std::int64_t k, T alpha, buffer<T, 1> &a, std::int64_t lda, buffer<T, 1> &b, std::int64_t ldb, T beta, buffer<T, 1> &c, std::int64_t ldc)
```

This one line of oneMKL function does all of the necessary optimizations for CPU/GPU offload compute and as you go through the exercises you will discover that it did indeed deliver the best results with the least amount of code across all of the platforms.

## Matrix Multiplication with Math Kernel Library (oneMKL)
The following SYCL code below uses a oneMKL kernel: Inspect code, there are no modifications necessary:

1. __Run__ ▶the cell following __Select offload device__, in Jupyter everything is linear, a subsequent run will need to choose a new target and then the following cell will need to be executed to get the updated results.
2. Inspect the following code cell and click __Run__ ▶ to save the code to a file.
3. Next run -- the cell in the __Build and Run__ section below the code to compile and execute the code.

#### Select Offload Device

In [None]:
run accelerator.py

In [None]:
%%writefile lab/mm_dpcpp_mkl.cpp
//==============================================================
// Matrix Multiplication: SYCL oneMKL
//==============================================================
// Copyright © 2021 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================


#include <CL/sycl.hpp>
#include "oneapi/mkl/blas.hpp"  //# oneMKL DPC++ interface for BLAS functions

using namespace sycl;

void mm_kernel(queue &q, std::vector<float> &matrix_a, std::vector<float> &matrix_b, std::vector<float> &matrix_c, size_t N, size_t M) {
    std::cout << "Configuration         : MATRIX_SIZE= " << N << "x" << N << "\n";
    
    //# Create buffers for matrices
    buffer a(matrix_a);
    buffer b(matrix_b);
    buffer c(matrix_c);

    //# scalar multipliers for oneMKL
    float alpha = 1.f, beta = 1.f;

    //# transpose status of matrices for oneMKL
    oneapi::mkl::transpose transA = oneapi::mkl::transpose::nontrans;
    oneapi::mkl::transpose transB = oneapi::mkl::transpose::nontrans;

    //# Submit MKL library call to execute on device
    oneapi::mkl::blas::gemm(q, transA, transB, N, N, N, alpha, b, N, a, N, beta, c, N);
    c.get_access<access::mode::read>();
}

#### Build and Run
Select the cell below and click __Run__ ▶ to compile and execute the code on selected device:

In [None]:
! chmod 755 q; chmod 755 run_mm_mkl.sh; if [ -x "$(command -v qsub)" ]; then ./q run_mm_mkl.sh "{device.value}"; else ./run_mm_mkl.sh; fi

### Roofline Report

Execute the following line to display the roofline results 


In [None]:
run display_data/mm_mkl_roofline.py

### VTune™ Profiler Summary

Execute the following line to display the VTune results.

In [None]:
run display_data/mm_mkl_vtune.py

## Basic Parallel Kernel Implementation

In this section we will look at how matrix multiplication can be implemented using a SYCL basic parallel kernel. This is the most simplest implementation using SYCL without any optimizations. In the next few modules we will add optimization on top of this implementation to improve the performance. 

<img src="Assets/naive.PNG">

We can define the kernel with `parallel_for` with a 2-dimentional range for the matrix and perform matrix multiplication as shown below:


```cpp
        h.parallel_for(range<2>{N,N}, [=](item<2> item){
            const int i = item.get_id(0);
            const int j = item.get_id(1);
            for (int k = 0; k < N; k++) {
                C[i*N+j] += A[i*N+k] * B[k*N+j];
            }
        });
```


## Matrix Multiplication with basic parallel kernel

The following SYCL code shows the basic parallel kernel implementation of matrix multiplication. Inspect code; there are no modifications necessary:

1. Run the cell in the __Select Offload Device__ section to choose a target device to run the code on.
2. Inspect the following code cell and click __Run__ ▶ to save the code to a file.
3. Next, run the cell in the __Build and Run__ section to compile and execute the code.

#### Select Offload Device

In [None]:
run accelerator.py

In [None]:
%%writefile lab/mm_dpcpp_basic.cpp
//==============================================================
// Matrix Multiplication: SYCL Basic Parallel Kernel
//==============================================================
// Copyright © 2021 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================


#include <CL/sycl.hpp>

using namespace sycl;

void mm_kernel(queue &q, std::vector<float> &matrix_a, std::vector<float> &matrix_b, std::vector<float> &matrix_c, size_t N, size_t M) {
    std::cout << "Configuration         : MATRIX_SIZE= " << N << "x" << N << "\n";
    
    //# Create buffers for matrices
    buffer a(matrix_a);
    buffer b(matrix_b);
    buffer c(matrix_c);

    //# Submit command groups to execute on device
    auto e = q.submit([&](handler &h){
        //# Create accessors to copy buffers to the device
        accessor A(a, h, read_only);
        accessor B(b, h, read_only);
        accessor C(c, h, write_only);

        //# Parallel Compute Matrix Multiplication
        h.parallel_for(range<2>{N,N}, [=](item<2> item){
            const int i = item.get_id(0);
            const int j = item.get_id(1);
            for (int k = 0; k < N; k++) {
                C[i*N+j] += A[i*N+k] * B[k*N+j];
            }
        });
    });
    host_accessor hc(c, read_only);
    
    //# print kernel compute duration from event profiling
    auto kernel_duration = (e.get_profiling_info<info::event_profiling::command_end>() - e.get_profiling_info<info::event_profiling::command_start>());
    std::cout << "Kernel Execution Time : " << kernel_duration / 1e+9 << " seconds\n";
}



#### Build and Run
Select the cell below and click __Run__ ▶ to compile and execute the code on selected device:

In [None]:
! chmod 755 q; chmod 755 run_mm_basic.sh;if [ -x "$(command -v qsub)" ]; then ./q run_mm_basic.sh "{device.value}"; else ./run_mm_basic.sh; fi

### Roofline Report

Execute the following line to display the roofline results 


In [None]:
run display_data/mm_basic_roofline.py

### VTune™ Profiler Summary

Execute the following line to display the VTune results.

In [None]:
run display_data/mm_basic_vtune.py

### Analysis

Comparing the execution times for Basic SYCL implementation and Math Kernel Library implementation for various matrix sizes, we can see that for small matrix size of 1024x1024, Basic SYCL implementation performs better than MKL implementation. When matrix size is large, MKL implementation out performs Basic SYCL implementation significantly. The graph below shows execution times on various hardware for matrix sizes 1024x1024, 5120x5120 and 10240x10240.

<img src=Assets/ppp_basic_mkl_graph.PNG>


### Summary

In this module we looked at oneAPI Math Kernel Library (oneMKL) and implemented matrix multiplication using oneMKL function. We also implemented matrix multiplication using SYCL basic parallel kernel. We compared performance numbers for the two implementations and can see benefits of using a library link oneMKL to implement computation rather than basic implementation using SYCL.


## Resources

Check out these related resources

#### Intel® oneAPI Toolkit documentation

* [Intel Advisor Roofline](https://software.intel.com/content/www/us/en/develop/articles/intel-advisor-roofline.html)
* [Intel VTune](https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/introduction.html)
* [Intel® oneAPI main page](https://software.intel.com/oneapi "oneAPI main page")
* [Intel® oneAPI programming guide](https://software.intel.com/sites/default/files/oneAPIProgrammingGuide_3.pdf "oneAPI programming guide")
* [Intel® DevCloud Signup](https://software.intel.com/en-us/devcloud/oneapi "Intel DevCloud")  Sign up here if you do not have an account.
* [Get Started with oneAPI for Linux*](https://software.intel.com/en-us/get-started-with-intel-oneapi-linux)
* [Get Started with oneAPI for Windows*](https://software.intel.com/en-us/get-started-with-intel-oneapi-windows)
* [Intel® oneAPI Code Samples](https://software.intel.com/en-us/articles/code-samples-for-intel-oneapibeta-toolkits)
* [oneAPI Specification elements](https://www.oneapi.com/spec/)

#### SYCL 
* [SYCL* 2020 Specification](https://www.khronos.org/registry/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf)

#### Modern C++
* [CPPReference](https://en.cppreference.com/w/)
* [CPlusPlus](http://www.cplusplus.com/)

***