# Profile Intel® oneAPI Deep Neural Network Library (oneDNN) Samples by using Intel® VTune™ Profiler and oneDNN ITT Tagging feature

## Learning Objectives
In this module the developer will:
* Learn how to use VTune™ Profiler to profile oneDNN samples on CPU & GPU
* Learn how to use oneDNN ITT Tagging feature to profile oneDNN samples on primitives level
* Learn how to identify performance bottlenecks by VTune profiling

***
# VTune Profiling Exercise


## prerequisites
***
### Step 1: Prepare the build/run environment
oneDNN has four different configurations inside the Intel oneAPI toolkits. Each configuration is in a different folder under the oneDNN installation path, and each configuration supports a different compiler or threading library  

Set the installation path of your oneAPI toolkit


In [None]:
# ignore all warning messages
import warnings
warnings.filterwarnings('ignore')

In [None]:
%env ONEAPI_INSTALL=/opt/intel/oneapi

In [None]:
import os
if os.path.isdir(os.environ['ONEAPI_INSTALL']) == False:
    print("ERROR! wrong oneAPI installation path")

In [None]:
!printf '%s\n'     $ONEAPI_INSTALL/dnnl/latest/cpu_*

As you can see, there are four different folders under the oneDNN installation path, and each of those configurations supports different features. This tutorial will show you how to compile and run against different oneDNN configurations.

First, create a lab folder for this exercise.

In [None]:
!rm -rf lab;mkdir lab

Install required python packages.

In [None]:
!pip3 install -r requirements.txt

Get current platform information for this exercise.

In [None]:
from profiling.profile_utils import PlatformUtils
plat_utils = PlatformUtils()
plat_utils.dump_platform_info()

###  Step 2: Preparing the performance profiling sample

This exercise uses the performance_profiling.cpp example from oneDNN installation path.
> NOTE: please refer to [oneDNN doc](https://oneapi-src.github.io/oneDNN/performance_profiling_cpp.html) for the details implementation of the performance_profiling.cpp.  

The section below will copy the performance_profiling.cpp file into the lab folder.  
This section also copies the required header files and CMake file into the lab folder.

In [None]:
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_dpcpp_gpu_dpcpp/examples/performance_profiling.cpp lab/

Users can browser source codes by running below section, and below section also remove comments for readability.

In [None]:
!cpp -fpreprocessed  -dD -E lab/performance_profiling.cpp

Then, copy the required header files and CMake file into the lab folder.

In [None]:
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_dpcpp_gpu_dpcpp/examples/example_utils.hpp lab/
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_dpcpp_gpu_dpcpp/examples/example_utils.h lab/
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_dpcpp_gpu_dpcpp/examples/CMakeLists.txt lab/

Patch examples to enlarge runtime for profiling

In [None]:
!cd lab;patch < ../codes_for_ipynb/add_loop.patch;cd ..

#### The performance profiling sample support different memory format

|supported memory format | command |Description|
|:-----|:----|:-----|
|naive| performance-profiling-cpp cpu naive |use plain format (ex: NCHW) for the convolution|
|blocked|performance-profiling-cpp cpu blocked|use blocked format (ex: nChw16c) for the convolution|
|fused|performance-profiling-cpp cpu fused||


### Step 3:  Build and Run with GNU Compiler and OpenMP 
One of the oneDNN configurations supports GNU compilers, but it can run only on CPU.
The following section shows you how to build with G++ and run on CPU.

#### Script - build.sh
The script **build.sh** encapsulates the compiler command and flags that will generate the exectuable.
The user must switch to the G++ oneDNN configurations by inputting a custom configuration "--dnnl-configuration=cpu_gomp" when running "source setvars.sh".
In order to use the G++ compiler and related OMP runtime, some definitions must be passed as cmake arguments.
Here are related cmake arguments for DPC++ configuration : 

  -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DDNNL_CPU_RUNTIME=OMP -DDNNL_GPU_RUNTIME=NONE

In [None]:
%%writefile build.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
export EXAMPLE_ROOT=./lab/
mkdir cpu_gomp
cd cpu_gomp
cmake .. -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DDNNL_CPU_RUNTIME=OMP -DDNNL_GPU_RUNTIME=NONE
make performance-profiling-cpp



Once you achieve an all-clear from your compilation, you execute your program on the DevCloud or in local environments.

#### Script - run.sh
the script **run.sh** encapsulates the program for submission to the job queue for execution.
The user must switch to the G++ oneDNN configuration by inputting a custom configuration "--dnnl-configuration=cpu_gomp" when running "source setvars.sh".

In [None]:
%%writefile run.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the run"
./cpu_gomp/out/performance-profiling-cpp
echo "########## Done with the run"



#### OPTIONAL : replace $ONEAPI_INSTALL with set value in both build.sh and run.sh
> NOTE : this step is mandatory if you run the notebook on DevCloud

In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('build.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )
file_utils.replace_string_in_file('run.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )



#### Submitting **build.sh** and **run.sh** to the job queue
Now we can submit the **build.sh** and **run.sh** to the job queue.

##### NOTE - it is possible to execute any of the build and run commands in local environments.
To enable users to run their scripts both on the DevCloud or in local environments, this and subsequent training checks for the existence of the job submission command **qsub**.  If the check fails, it is assumed that build/run will be local.

In [None]:
! rm -rf cpu_gomp;chmod 755 q; chmod 755 build.sh; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q build.sh; ./q run.sh; else ./build.sh; ./run.sh; fi

## Profiling oneDNN Performance by VTune™
***
In this section, we will profile the performance profiling sample with both the naive data format and the blocked data format by using VTune™ and [ITT tagging feature](https://github.com/oneapi-src/oneDNN/tree/rfcs/rfcs/20201014-VTune-ITT-tagging) from oneDNN.  
Users should identify different vectorization and memory bound ratio for each primitive among those two data formats.
Therefore, users could understand how data format impacts the performance with those supportive data.

We uses 3 different VTune™ profiling types in this tutorial.
Users could refer to [VTune™ performance analysis](https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance.html) for more profiling types.  
Users could also refer to the [dev_guide_profilers](https://oneapi-src.github.io/oneDNN/dev_guide_profilers.html) for oneDNN related VTune™ profiling information.

###  Different VTune™ Profiling Types

|Profiling Type|collect augument |Description|
|:-----|:----|:-----|
|Hotspots| 0 |no verbose output (default)|
|Microarchitecture|1|primitive information at execution|
|Threading|2|primitive information at creation and execution|



###  Profile the performance profiling sample with naive data format
Naive implementation executes 2D convolution followed by ReLU on the data in NCHW format. This implementation does not align with oneDNN best practices and results in suboptimal performance.   
In this section, we will use identify those performance bottlenecks caused by naive data format. 

> NOTE: Please refer to this page : [understanding memory format](https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html) for more details of different data formats. 

#### 1. Hotspots Profiling Type

#####  Top oneDNN primitive hotspots
First, we want to know which primitive takes most of the time.  
We will profile the sample by using profile.sh and then analyze the hotspot by using analyze.sh. 

#### Script - profile.sh
the script **profile.sh** encapsulates the program for submission to the profiling job queue for execution.

In [None]:
%%writefile profile.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the profiling"
vtune -collect hotspots -q -no-summary -knob sampling-mode=hw -r dnnl-vtune ./cpu_gomp/out/performance-profiling-cpp cpu naive
echo "########## Done with the profiling"


#### Script - analyze.sh
the script **analyze.sh** encapsulates the program for submission to the analyzing job queue for execution.  
In below section, we filter out the column "CPU Time:Self" from the VTune™ result.

In [None]:
%%writefile analyze.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the analyzing"
vtune -report hotspots -q -r dnnl-vtune -format csv -csv-delimiter ',' -group-by task -column 'CPU Time:Self' | head -n 10 > hotspot.csv
echo "########## Done with the analyzing"


OPTIONAL : replace ONEAPI_INSTALL with set value in both profile.sh and analyze.sh.
> NOTE : this step is mandatory if you run the notebook on DevCloud

In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('profile.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )
file_utils.replace_string_in_file('analyze.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )


#### Submitting profile.sh and analyze.sh to the job queue
Now we can submit the profile.sh and analyze.sh to the job queue.

> NOTE - it is possible to execute any of the profile and analyze commands in local environments.
To enable users to run their scripts both on the DevCloud or in local environments, this and subsequent training checks for the existence of the job submission command qsub. If the check fails, it is assumed that build/run will be local.

In [None]:
!chmod 755 q;chmod 755 profile.sh;if [ -x "$(command -v qsub)" ]; then ./q profile.sh; else ./profile.sh; fi

In [None]:
!chmod 755 q;chmod 755 analyze.sh;if [ -x "$(command -v qsub)" ]; then ./q analyze.sh; else ./analyze.sh; fi; 

#### Plot Pie Chart to illustrate the CPU time percentage among different primitives
We also show the absolute CPU time below.

In [None]:
import pandas as pd
data = pd.read_csv('hotspot.csv', engine='python')
if data.empty is False:
    print(data)
    if len(data.columns) >= 2:            
        data.plot.pie(y=data.columns[1], labels=data.iloc[:,0], figsize=(8, 8), fontsize=20, autopct='%1.1f%%')

#### 2. Microarchitecture Profiling Type

#####  2.1 Vectorization over oneDNN primitives
Second, we want to know how well each primitive is vectorized, and microarchitecture profiling type provides those data. 
We will profile the sample by using profile.sh and then analyze how well it is vectorized by using analyze.sh. 

#### Script - profile.sh
the script **profile.sh** encapsulates the program for submission to the profiling job queue for execution.

In [None]:
%%writefile profile.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the profiling"
vtune -collect uarch-exploration -knob sampling-interval=1 -data-limit=2000 -q -no-summary -r dnnl-vtune-ue ./cpu_gomp/out/performance-profiling-cpp cpu naive
echo "########## Done with the profiling"


#### Script - analyze.sh
the script **analyze.sh** encapsulates the program for submission to the analyzing job queue for execution.  
In below section, we filter out the column "FP Arithmetic" from the VTune™ result.

In [None]:
%%writefile analyze.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the analyzing"
vtune -report hotspots -q -r dnnl-vtune-ue -format csv -csv-delimiter ',' -group-by task -column 'FP Arithmetic' | head -n 10 > fp.csv
echo "########## Done with the analyzing"


OPTIONAL : replace ONEAPI_INSTALL with set value in both profile.sh and analyze.sh.
> NOTE : this step is mandatory if you run the notebook on DevCloud

In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('profile.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )
file_utils.replace_string_in_file('analyze.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )


#### Submitting profile.sh and analyze.sh to the job queue
Now we can submit the profile.sh and analyze.sh to the job queue.

In [None]:
!chmod 755 q;chmod 755 profile.sh;if [ -x "$(command -v qsub)" ]; then ./q profile.sh; else ./profile.sh; fi

In [None]:
!chmod 755 q;chmod 755 analyze.sh;if [ -x "$(command -v qsub)" ]; then ./q analyze.sh; else ./analyze.sh; fi;

#### Plot Bar Chart to illustrate the Scalar and Vector ops percentage among different primitives
For FP Arithmetic column, there are more sub columns under it.  
We dump all the sub columns name in below section, and pick column 4 for FP vector information.  

In [None]:
import pandas as pd
data = pd.read_csv('fp.csv', engine='python')
if data.empty is False:
     if len(data.columns) >= 5:            
        i = 0
        for col in data.columns:
            print(" column %d : %s "%(i,col))
            i += 1

        data.plot.bar(x=data.columns[0], y=data.columns[4], rot=0, title="", ylim=(0,100), figsize=(8,5),fontsize=12);

> NOTE : users should be able to see only ~40% FP Vector ratio for convolution primitive.  
It is suboptimal, and we should try to improve it to ~100%.

#####  2.2 Memory Bound over oneDNN primitives

Third, we want to know any memory problem for those primitives, and microarchitecture profiling type provides those data. 
We will profile the sample by using profile.sh and then analyze memory bound issues by using analyze.sh. 

#### Script - analyze.sh
the script **analyze.sh** encapsulates the program for submission to the analyzing job queue for execution.  
In below section, we filter out the column "Memory Bound" from the VTune™ result.

In [None]:
%%writefile analyze.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the analyzing"
vtune -report hotspots -q -r dnnl-vtune-ue -format csv -csv-delimiter ',' -group-by task -column 'Memory Bound' | head -n 10 > memory.csv
echo "########## Done with the analyzing"


OPTIONAL : replace ONEAPI_INSTALL with set value in both profile.sh and analyze.sh.
> NOTE : this step is mandatory if you run the notebook on DevCloud

In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('analyze.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )


#### Submitting profile.sh and analyze.sh to the job queue
Now we can submit the profile.sh and analyze.sh to the job queue.

In [None]:
!chmod 755 q;chmod 755 analyze.sh;if [ -x "$(command -v qsub)" ]; then ./q analyze.sh; else ./analyze.sh; fi

#### Plot Bar Chart to identify any DRAM/L1/L2/L3 bound issues among different primitives
For Memory Bound column, there are more sub columns under it.  
Users can dump all the sub columns name in below section by unmarking the line of print().  
We pick column 1,2,11,12,17 for different memory bound information.    

In [None]:
import pandas as pd
data = pd.read_csv('memory.csv', engine='python')
if data.empty is False:
    if len(data.columns) >= 18:            
        i = 0
        for col in data.columns:
            #print(" column %d : %s "%(i,col))
            i += 1

        data.plot.bar(x=data.columns[0], y=data.columns[1], rot=0, title="", ylim=(0,100), figsize=(8,5),fontsize=12);
        data.plot.bar(x=data.columns[0], y=data.columns[2], rot=0, title="", ylim=(0,100), figsize=(8,5),fontsize=12);
        data.plot.bar(x=data.columns[0], y=data.columns[11], rot=0, title="", ylim=(0,100), figsize=(8,5),fontsize=12);
        data.plot.bar(x=data.columns[0], y=data.columns[12], rot=0, title="", ylim=(0,100), figsize=(8,5),fontsize=12);
        data.plot.bar(x=data.columns[0], y=data.columns[17], rot=0, title="", ylim=(0,100), figsize=(8,5),fontsize=12);

> NOTE : users should be able to see more DRAM problem than L1/L2/L3 bound problems for eltwise primitive.  
In general, we want to reduce all memory bound issues, but better to have less DRAM bound than L1/L2/L3 bound.  
If you face more DRAM bound than L3 bound, it might mean that most of your data are not in L3 cache. 
We prefer to have data in L1/L2/L3 caches. 

#### 3. Theading Profiling Type

#####  Thread Oversubscription
Finally, we show case how to identify a thread oversubscription problem in VTune™, and Threading profiling type provides those data.  
We will profile the sample by using profile.sh and then generate a summary output with oversubscription information by using analyze.sh. 

> NOTE : we make the thread oversubscription problem by setting OpenMP thread number to a very big value. Therefore, VTune™ will identify thread oversubscription problem caused by this wrong setting. 

In [None]:
%%writefile profile.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the profiling"
export OMP_NUM_THREADS=200 
vtune -collect threading -data-limit=2000 -q -no-summary -r dnnl-vtune-th ./cpu_gomp/out/performance-profiling-cpp cpu naive
echo "########## Done with the profiling"


#### Script - analyze.sh
the script **analyze.sh** encapsulates the program for submission to the analyzing job queue for execution.  
In below section, we just generate a summary report.

In [None]:
%%writefile analyze.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the analyzing"
vtune -report summary -r dnnl-vtune-th --format html -report-output summary.html
echo "########## Done with the analyzing"


OPTIONAL : replace ONEAPI_INSTALL with set value in both profile.sh and analyze.sh.
> NOTE : this step is mandatory if you run the notebook on DevCloud

In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('profile.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )
file_utils.replace_string_in_file('analyze.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )


#### Submitting profile.sh and analyze.sh to the job queue
Now we can submit the profile.sh and analyze.sh to the job queue.

In [None]:
!rm -rf dnnl-vtune-th; chmod 755 q;chmod 755 profile.sh;if [ -x "$(command -v qsub)" ]; then ./q profile.sh; else ./profile.sh; fi

In [None]:
!chmod 755 q;chmod 755 analyze.sh;if [ -x "$(command -v qsub)" ]; then ./q analyze.sh; else ./analyze.sh; fi

#### Check Summary page from Thread Profiling
Please check below summary report.  
You should be able to see the total thread count which is the value of OMP_NUM_THREADS.  
You should also see how long this workload has thread oversubscription issue. 

In [None]:
from IPython.display import IFrame
IFrame(src='summary.html', width=960, height=600)

###  Profile the performance profiling sample with blocked data format
Blocked format implementation executes the same operations sequence on the blocked format optimized for convolution performance. This implementation uses format_tag=ANY to create a convolution memory descriptor to determine the data format optimal for the convolution implementation. It then propagates the blocked format to the non-intensive ReLU. This implementation results in better overall performance than the naive implementation.  
In this section, we will use identify those performance improvements including better vectorization when users change data format from naive to blocked.   

#### 1. Hotspots Profiling Type

#####  Top oneDNN primitive hotspots
First, we want to know which primitive takes most of the time.  
We will profile the sample by using profile.sh and then analyze the hotspot by using analyze.sh. 

In [None]:
%%writefile profile.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the profiling"
vtune -collect hotspots -q -no-summary -knob sampling-mode=hw -r dnnl-vtune-b ./cpu_gomp/out/performance-profiling-cpp cpu blocked
echo "########## Done with the profiling"


#### Script - analyze.sh
the script **analyze.sh** encapsulates the program for submission to the analyzing job queue for execution.  
In below section, we filter out the column "CPU Time:Self" from the VTune™ result.

In [None]:
%%writefile analyze.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the analyzing"
vtune -report hotspots -q -r dnnl-vtune-b -format csv -csv-delimiter ',' -group-by task -column 'CPU Time:Self' | head -n 10 > hotspot_b.csv
echo "########## Done with the analyzing"


OPTIONAL : replace ONEAPI_INSTALL with set value in both profile.sh and analyze.sh.
> NOTE : this step is mandatory if you run the notebook on DevCloud

In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('profile.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )
file_utils.replace_string_in_file('analyze.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )


#### Submitting profile.sh and analyze.sh to the job queue
Now we can submit the profile.sh and analyze.sh to the job queue.

In [None]:
!chmod 755 q;chmod 755 profile.sh;if [ -x "$(command -v qsub)" ]; then ./q profile.sh; else ./profile.sh; fi

In [None]:
!chmod 755 q;chmod 755 analyze.sh;if [ -x "$(command -v qsub)" ]; then ./q analyze.sh; else ./analyze.sh; fi

#### Plot Pie Chart to illustrate the CPU time percentage among different primitives
We also show the absolute CPU time below.

In [None]:
import pandas as pd
data = pd.read_csv('hotspot_b.csv', engine='python')
if data.empty is False:
    print(data)
    if len(data.columns) >= 2:                
        data.plot.pie(y=data.columns[1], labels=data.iloc[:,0], figsize=(8, 8), fontsize=20, autopct='%1.1f%%')

#### 2. Microarchitecture Profiling Type

#####  2.1 Vectorization over oneDNN primitives
Second, we want to know how well each primitive is vectorized, and microarchitecture profiling type provides those data. 
We will profile the sample by using profile.sh and then analyze how well it is vectorized by using analyze.sh. 

In [None]:
%%writefile profile.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the profiling"
vtune -collect uarch-exploration -knob sampling-interval=1 -data-limit=2000 -q -no-summary -r dnnl-vtune-ue-b ./cpu_gomp/out/performance-profiling-cpp cpu blocked
echo "########## Done with the profiling"


#### Script - analyze.sh
the script **analyze.sh** encapsulates the program for submission to the analyzing job queue for execution.  
In below section, we filter out the column "FP Arithmetic" from the VTune result.

In [None]:
%%writefile analyze.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the analyzing"
vtune -report hotspots -q -r dnnl-vtune-ue-b -format csv -csv-delimiter ',' -group-by task -column 'FP Arithmetic' | head -n 10 > fp_b.csv
echo "########## Done with the analyzing"


OPTIONAL : replace ONEAPI_INSTALL with set value in both profile.sh and analyze.sh.
> NOTE : this step is mandatory if you run the notebook on DevCloud

In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('profile.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )
file_utils.replace_string_in_file('analyze.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )


#### Submitting profile.sh and analyze.sh to the job queue
Now we can submit the profile.sh and analyze.sh to the job queue.

In [None]:
!chmod 755 q;chmod 755 profile.sh;if [ -x "$(command -v qsub)" ]; then ./q profile.sh; else ./profile.sh; fi

In [None]:
!chmod 755 q;chmod 755 analyze.sh;if [ -x "$(command -v qsub)" ]; then ./q analyze.sh; else ./analyze.sh; fi

#### Plot Bar Chart to illustrate the Scalar and Vector ops percentage among different primitives
For FP Arithmetic column, there are more sub columns under it.  
We just pick column 4 for FP vector information.  

In [None]:
import pandas as pd
data = pd.read_csv('fp_b.csv', engine='python')
if data.empty is False:
    if len(data.columns) >= 5:            
        data.plot.bar(x=data.columns[0], y=data.columns[4], rot=0, title="", ylim=(0,100), figsize=(8,5),fontsize=12);

 > NOTE : users should be able to see 100% FP Vector ratio for convolution primitive.  
It is optimal because we switch data format from naive to blocked, so data format indeed helps on vectorization.

#####  2.2 Memory Bound over oneDNN primitives

Third, we want to know any memory problem for those primitives, and microarchitecture profiling type provides those data. 
We will profile the sample by using profile.sh and then analyze memory bound issues by using analyze.sh. 

#### Script - analyze.sh
the script **analyze.sh** encapsulates the program for submission to the analyzing job queue for execution.  
In below section, we filter out the column "Memory Bound" from the VTune™ result.

In [None]:
%%writefile analyze.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
echo "########## Executing the analyzing"
vtune -report hotspots -q -r dnnl-vtune-ue-b -format csv -csv-delimiter ',' -group-by task -column 'Memory Bound' | head -n 10 > memory_b.csv
echo "########## Done with the analyzing"


OPTIONAL : replace ONEAPI_INSTALL with set value in both profile.sh and analyze.sh.
> NOTE : this step is mandatory if you run the notebook on DevCloud

In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('analyze.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )


#### Submitting profile.sh and analyze.sh to the job queue
Now we can submit the profile.sh and analyze.sh to the job queue.

In [None]:
!chmod 755 q;chmod 755 analyze.sh;if [ -x "$(command -v qsub)" ]; then ./q analyze.sh; else ./analyze.sh; fi

#### Plot Bar Chart to identify any DRAM/L1/L2/L3 bound issues among different primitives
For Memory Bound column, there are more sub columns under it.  
Users can dump all the sub columns name in below section by unmarking the line of print().  
We pick column 1,2,11,12,17 for different memory bound information.    

In [None]:
import pandas as pd
data = pd.read_csv('memory_b.csv', engine='python')
if data.empty is False:
    if len(data.columns) >= 18:            
        #print(data.columns)

        data.plot.bar(x=data.columns[0], y=data.columns[1], rot=0, title="", ylim=(0,100), figsize=(8,5),fontsize=12);
        data.plot.bar(x=data.columns[0], y=data.columns[2], rot=0, title="", ylim=(0,100), figsize=(8,5),fontsize=12);
        data.plot.bar(x=data.columns[0], y=data.columns[11], rot=0, title="", ylim=(0,100), figsize=(8,5),fontsize=12);
        data.plot.bar(x=data.columns[0], y=data.columns[12], rot=0, title="", ylim=(0,100), figsize=(8,5),fontsize=12);
        data.plot.bar(x=data.columns[0], y=data.columns[17], rot=0, title="", ylim=(0,100), figsize=(8,5),fontsize=12);

***
# Summary
In this lab the developer learned the following:

* Use VTune™ Profiler to profile oneDNN samples with different profiling types.
* Use oneDNN ITT Tagging feature to profile oneDNN samples on each primitive defined as a Task in a VTune™ result.
* Identify performance bottlenecks by VTune™ profiling on primitive level.
* Understand the performance impact among different data formats for oneDNN workloads.
