# Profile Intel® oneAPI Deep Neural Network Library (oneDNN) Samples by using Intel® VTune Profiler and oneDNN ITT Tagging feature

## Learning Objectives
In this module the developer will:
* Learn how to use VTune Profiler to profile oneDNN samples on CPU & GPU
* Learn how to use oneDNN ITT Tagging feature to profile oneDNN samples one primitives level
* Learn how to identify performance bottlenecks by VTune profiling

***
# VTune Profiling Exercise


## prerequisites
***
### Step 1: Prepare the build/run environment
oneDNN has four different configurations inside the Intel oneAPI toolkits. Each configuration is in a different folder under the oneDNN installation path, and each configuration supports a different compiler or threading library  

Set the installation path of your oneAPI toolkit


In [None]:
# ignore all warning messages
import warnings
warnings.filterwarnings('ignore')

In [None]:
%env ONEAPI_INSTALL=/opt/intel/oneapi

In [None]:
import os
if os.path.isdir(os.environ['ONEAPI_INSTALL']) == False:
    print("ERROR! wrong oneAPI installation path")

In [None]:
!printf '%s\n'     $ONEAPI_INSTALL/dnnl/latest/cpu_*

As you can see, there are four different folders under the oneDNN installation path, and each of those configurations supports different features. This tutorial will show you how to compile and run against different oneDNN configurations.

First, create a lab folder for this exercise.

In [None]:
!rm -rf lab;mkdir lab

Install required python packages.

In [None]:
!pip3 install -r requirements.txt

Get current platform information for this exercise.

In [None]:
from profiling.profile_utils import PlatformUtils
plat_utils = PlatformUtils()
plat_utils.dump_platform_info()

###  Step 2: Preparing the performance profiling sample

This exercise uses the performance_profiling.cpp example from oneDNN installation path.

The section below will copy the performance_profiling.cpp file into the lab folder.  
This section also copies the required header files and CMake file into the lab folder.

In [None]:
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_dpcpp_gpu_dpcpp/examples/performance_profiling.cpp lab/

Users can browser source codes by running below section, and below section also remove comments for readability.

In [None]:
!cpp -fpreprocessed  -dD -E lab/performance_profiling.cpp

Then, copy the required header files and CMake file into the lab folder.

In [None]:
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_dpcpp_gpu_dpcpp/examples/example_utils.hpp lab/
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_dpcpp_gpu_dpcpp/examples/example_utils.h lab/
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_dpcpp_gpu_dpcpp/examples/CMakeLists.txt lab/

Patch examples to enlarge runtime for profiling

In [None]:
!

#### The performance profiling sample support different memory format

|supported memory format | command |Description|
|:-----|:----|:-----|
|naive| performance-profiling-cpp cpu naive |use plain format (ex: NCHW) for the convolution|
|blocked|performance-profiling-cpp cpu blocked|use blocked format (ex: nChw16c) for the convolution|
|fused|performance-profiling-cpp cpu fused||


### Step 3:  Build and Run with GNU Compiler and OpenMP 
One of the oneDNN configurations supports GNU compilers, but it can run only on CPU.
The following section shows you how to build with G++ and run on CPU.

#### Script - build.sh
The script **build.sh** encapsulates the compiler command and flags that will generate the exectuable.
The user must switch to the G++ oneDNN configurations by inputting a custom configuration "--dnnl-configuration=cpu_gomp" when running "source setvars.sh".
In order to use the G++ compiler and related OMP runtime, some definitions must be passed as cmake arguments.
Here are related cmake arguments for DPC++ configuration : 

  -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DDNNL_CPU_RUNTIME=OMP -DDNNL_GPU_RUNTIME=NONE

In [None]:
%%writefile build.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
export EXAMPLE_ROOT=./lab/
mkdir cpu_gomp
cd cpu_gomp
cmake .. -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DDNNL_CPU_RUNTIME=OMP -DDNNL_GPU_RUNTIME=NONE
make performance-profiling-cpp



Once you achieve an all-clear from your compilation, you execute your program on the DevCloud or in local environments.

#### Script - run.sh
the script **run.sh** encapsulates the program for submission to the job queue for execution.
The user must switch to the G++ oneDNN configuration by inputting a custom configuration "--dnnl-configuration=cpu_gomp" when running "source setvars.sh".

In [None]:
%%writefile run.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the run"
./cpu_gomp/out/performance-profiling-cpp
echo "########## Done with the run"



#### OPTIONAL : replace $ONEAPI_INSTALL with set value in both build.sh and run.sh
> NOTE : this step is mandatory if you run the notebook on DevCloud

In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('build.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )
file_utils.replace_string_in_file('run.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )



#### Submitting **build.sh** and **run.sh** to the job queue
Now we can submit the **build.sh** and **run.sh** to the job queue.

##### NOTE - it is possible to execute any of the build and run commands in local environments.
To enable users to run their scripts both on the DevCloud or in local environments, this and subsequent training checks for the existence of the job submission command **qsub**.  If the check fails, it is assumed that build/run will be local.

In [None]:
! rm -rf cpu_gomp;chmod 755 q; chmod 755 build.sh; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q build.sh; ./q run.sh; else ./build.sh; ./run.sh; fi

## Profiling oneDNN Performance by VTune
***
In this section, we ....  
Refer to the [link](https://oneapi-src.github.io/oneDNN/dev_guide_profilers.html) for related VTune profiling information

###  Different VTune Profiling Types

|Profiling Type|collect augument |Description|
|:-----|:----|:-----|
|Hotspots| 0 |no verbose output (default)|
|Microarchitecture|1|primitive information at execution|
|Threading|2|primitive information at creation and execution|

fff

###  Profile the performance profiling sample with naive data format


####  1. Top oneDNN primitive hotspots

* vtune -collect hotspots -q -no-summary -knob sampling-mode=hw -r dnnl-vtune ./bin/performance-profiling-cpp cpu naive
* amplxe-cl -report hotspots -q -r dnnl-vtune -format csv -csv-delimiter ';' -group-by task -column 'CPU Time:Self' | head -n 10 | column -t -s';'

In [None]:
%%writefile profile.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the profiling"
vtune -collect hotspots -q -no-summary -knob sampling-mode=hw -r dnnl-vtune ./cpu_gomp/out/performance-profiling-cpp cpu naive
echo "########## Done with the profiling"


In [None]:
%%writefile analyze.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the analyzing"
vtune -report hotspots -q -r dnnl-vtune -format csv -csv-delimiter ';' -group-by task -column 'CPU Time:Self' | head -n 10 | column -t -s';'
echo "########## Done with the analyzing"


In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('profile.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )
file_utils.replace_string_in_file('analyze.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )


In [None]:
!chmod 755 q;chmod 755 profile.sh;if [ -x "$(command -v qsub)" ]; then ./q profile.sh; else ./profile.sh; fi

In [None]:
!chmod 755 q;chmod 755 analyze.sh;if [ -x "$(command -v qsub)" ]; then ./q analyze.sh; else ./analyze.sh; fi

####  2. Vectorization over oneDNN primitives

* vtune -collect uarch-exploration -knob sampling-interval=1 -data-limit=2000  -q -no-summary -r dnnl-vtune-ue ./bin/performance-profiling-cpp cpu naive
* vtune -report hotspots -q -r dnnl-vtune-ue-2 -format csv -csv-delimiter ';' -group-by task -column 'FP Arithmetic:FP Vector' | head -n 10 | column -t -s';'
* vtune -report hotspots -q -r dnnl-vtune-ue-1 -format csv -csv-delimiter ';' -group-by task -column 'FP Arithmetic:FP Scalar' | head -n 10 | column -t -s';'
*  vtune -report hotspots -q -r dnnl-vtune-ue-1 -format csv -csv-delimiter ';' -group-by task -column 'Vector Capacity Usage' | head -n 10 | column -t -s';'

In [None]:
%%writefile profile.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the profiling"
vtune -collect uarch-exploration -knob sampling-interval=1 -data-limit=2000 -q -no-summary -r dnnl-vtune-ue ./cpu_gomp/out/performance-profiling-cpp cpu naive
echo "########## Done with the profiling"


In [None]:
%%writefile analyze.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the analyzing"
vtune -report hotspots -q -r dnnl-vtune-ue -format csv -csv-delimiter ';' -group-by task -column 'FP Arithmetic:FP Vector' | head -n 10 | column -t -s';'
vtune -report hotspots -q -r dnnl-vtune-ue -format csv -csv-delimiter ';' -group-by task -column 'FP Arithmetic:FP Scalar' | head -n 10 | column -t -s';'
vtune -report hotspots -q -r dnnl-vtune-ue -format csv -csv-delimiter ';' -group-by task -column 'Vector Capacity Usage' | head -n 10 | column -t -s';'
echo "########## Done with the analyzing"


In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('profile.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )
file_utils.replace_string_in_file('analyze.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )


In [None]:
!chmod 755 q;chmod 755 profile.sh;if [ -x "$(command -v qsub)" ]; then ./q profile.sh; else ./profile.sh; fi

In [None]:
!chmod 755 q;chmod 755 analyze.sh;if [ -x "$(command -v qsub)" ]; then ./q analyze.sh; else ./analyze.sh; fi

####  3. Memory Bound over oneDNN primitives

* vtune -collect hotspots -q -no-summary -knob sampling-mode=hw -r dnnl-vtune ./bin/performance-profiling-cpp cpu naive
* vtune -report hotspots -q -r dnnl-vtune-ue-3 -format csv -csv-delimiter ';' -group-by task -column 'DRAM Bound' | head -n 10 | column -t -s';'
* vtune -report hotspots -q -r dnnl-vtune-ue-2 -format csv -csv-delimiter ';' -group-by task -column 'L3 Bound' | head -n 10 | column -t -s';'
* vtune -report hotspots -q -r dnnl-vtune-ue-3 -format csv -csv-delimiter ';' -group-by task -column 'L1 Bound' | head -n 10 | column -t -s';'

In [None]:
%%writefile analyze.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the analyzing"
vtune -report hotspots -q -r dnnl-vtune-ue -format csv -csv-delimiter ';' -group-by task -column 'DRAM Bound' | head -n 10 | column -t -s';'
vtune -report hotspots -q -r dnnl-vtune-ue -format csv -csv-delimiter ';' -group-by task -column 'L3 Bound' | head -n 10 | column -t -s';'
vtune -report hotspots -q -r dnnl-vtune-ue -format csv -csv-delimiter ';' -group-by task -column 'L1 Bound' | head -n 10 | column -t -s';'
echo "########## Done with the analyzing"


In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('analyze.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )


In [None]:
!chmod 755 q;chmod 755 analyze.sh;if [ -x "$(command -v qsub)" ]; then ./q analyze.sh; else ./analyze.sh; fi

####  4. Thread Oversubscription

* vtune -collect threading -data-limit=2000  -q -no-summary -r dnnl-vtune-th-1 ./bin/performance-profiling-cpp cpu naive
* vtune -report summary -result-dir $(pwd)/vtune_data -format html -report-output $(pwd)/summary.html
get number of oversubscription


In [None]:
%%writefile profile.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the profiling"
export OMP_NUM_THREADS=24 
vtune -collect threading -data-limit=2000 -q -no-summary -r dnnl-vtune-th ./cpu_gomp/out/performance-profiling-cpp cpu naive
echo "########## Done with the profiling"


In [None]:
%%writefile analyze.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the analyzing"
vtune -report summary -r dnnl-vtune-th --format html -report-output summary.html
echo "########## Done with the analyzing"


In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('profile.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )
file_utils.replace_string_in_file('analyze.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )


In [None]:
!rm -rf dnnl-vtune-th; chmod 755 q;chmod 755 profile.sh;if [ -x "$(command -v qsub)" ]; then ./q profile.sh; else ./profile.sh; fi

In [None]:
!chmod 755 q;chmod 755 analyze.sh;if [ -x "$(command -v qsub)" ]; then ./q analyze.sh; else ./analyze.sh; fi

In [None]:
from IPython.display import IFrame
IFrame(src='summary.html', width=960, height=600)

###  Profile the performance profiling sample with blocked data format


####  1. Top oneDNN primitive hotspots

* vtune -collect hotspots -q -no-summary -knob sampling-mode=hw -r dnnl-vtune ./bin/performance-profiling-cpp cpu naive
* amplxe-cl -report hotspots -q -r dnnl-vtune -format csv -csv-delimiter ';' -group-by task -column 'CPU Time:Self' | head -n 10 | column -t -s';'

In [None]:
%%writefile profile.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the profiling"
vtune -collect hotspots -q -no-summary -knob sampling-mode=hw -r dnnl-vtune-b ./cpu_gomp/out/performance-profiling-cpp cpu blocked
echo "########## Done with the profiling"


In [None]:
%%writefile analyze.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the analyzing"
vtune -report hotspots -q -r dnnl-vtune-b -format csv -csv-delimiter ';' -group-by task -column 'CPU Time:Self' | head -n 10 | column -t -s';'
echo "########## Done with the analyzing"


In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('profile.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )
file_utils.replace_string_in_file('analyze.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )


In [None]:
!chmod 755 q;chmod 755 profile.sh;if [ -x "$(command -v qsub)" ]; then ./q profile.sh; else ./profile.sh; fi

In [None]:
!chmod 755 q;chmod 755 analyze.sh;if [ -x "$(command -v qsub)" ]; then ./q analyze.sh; else ./analyze.sh; fi

####  2. Vectorization over oneDNN primitives

* vtune -collect uarch-exploration -knob sampling-interval=1 -data-limit=2000  -q -no-summary -r dnnl-vtune-ue ./bin/performance-profiling-cpp cpu naive
* vtune -report hotspots -q -r dnnl-vtune-ue-2 -format csv -csv-delimiter ';' -group-by task -column 'FP Arithmetic:FP Vector' | head -n 10 | column -t -s';'
* vtune -report hotspots -q -r dnnl-vtune-ue-1 -format csv -csv-delimiter ';' -group-by task -column 'FP Arithmetic:FP Scalar' | head -n 10 | column -t -s';'
*  vtune -report hotspots -q -r dnnl-vtune-ue-1 -format csv -csv-delimiter ';' -group-by task -column 'Vector Capacity Usage' | head -n 10 | column -t -s';'

In [None]:
%%writefile profile.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the profiling"
vtune -collect uarch-exploration -knob sampling-interval=1 -data-limit=2000 -q -no-summary -r dnnl-vtune-ue-b ./cpu_gomp/out/performance-profiling-cpp cpu blocked
echo "########## Done with the profiling"


In [None]:
%%writefile analyze.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the analyzing"
vtune -report hotspots -q -r dnnl-vtune-ue-b -format csv -csv-delimiter ';' -group-by task -column 'FP Arithmetic:FP Vector' | head -n 10 | column -t -s';'
vtune -report hotspots -q -r dnnl-vtune-ue-b -format csv -csv-delimiter ';' -group-by task -column 'FP Arithmetic:FP Scalar' | head -n 10 | column -t -s';'
vtune -report hotspots -q -r dnnl-vtune-ue-b -format csv -csv-delimiter ';' -group-by task -column 'Vector Capacity Usage' | head -n 10 | column -t -s';'
echo "########## Done with the analyzing"


In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('profile.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )
file_utils.replace_string_in_file('analyze.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )


In [None]:
!chmod 755 q;chmod 755 profile.sh;if [ -x "$(command -v qsub)" ]; then ./q profile.sh; else ./profile.sh; fi

In [None]:
!chmod 755 q;chmod 755 analyze.sh;if [ -x "$(command -v qsub)" ]; then ./q analyze.sh; else ./analyze.sh; fi

####  3. Memory Bound over oneDNN primitives

* vtune -collect hotspots -q -no-summary -knob sampling-mode=hw -r dnnl-vtune ./bin/performance-profiling-cpp cpu naive
* vtune -report hotspots -q -r dnnl-vtune-ue-3 -format csv -csv-delimiter ';' -group-by task -column 'DRAM Bound' | head -n 10 | column -t -s';'
* vtune -report hotspots -q -r dnnl-vtune-ue-2 -format csv -csv-delimiter ';' -group-by task -column 'L3 Bound' | head -n 10 | column -t -s';'
* vtune -report hotspots -q -r dnnl-vtune-ue-3 -format csv -csv-delimiter ';' -group-by task -column 'L1 Bound' | head -n 10 | column -t -s';'

In [None]:
%%writefile analyze.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the analyzing"
vtune -report hotspots -q -r dnnl-vtune-ue-b -format csv -csv-delimiter ';' -group-by task -column 'DRAM Bound' | head -n 10 | column -t -s';'
vtune -report hotspots -q -r dnnl-vtune-ue-b -format csv -csv-delimiter ';' -group-by task -column 'L3 Bound' | head -n 10 | column -t -s';'
vtune -report hotspots -q -r dnnl-vtune-ue-b -format csv -csv-delimiter ';' -group-by task -column 'L1 Bound' | head -n 10 | column -t -s';'
echo "########## Done with the analyzing"


In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('analyze.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )


In [None]:
!chmod 755 q;chmod 755 analyze.sh;if [ -x "$(command -v qsub)" ]; then ./q analyze.sh; else ./analyze.sh; fi

***
# Summary
In this lab the developer learned the following:
* What are the different oneDNN configurations inside the Intel oneAPI toolkits
* How to compile a oneDNN sample with different configurations via batch jobs on the Intel oneAPI DevCloud or in local environments
* How to program oneDNN with a simple sample
