# Profile Intel® oneAPI Deep Neural Network Library (oneDNN) Samples by using Verobse Mode and JIT DUMP inspection

## Learning Objectives
In this module the developer will:
* Learn how to use Verbose Mode to profile oneDNN samples on CPU & GPU
* Learn how to inspect JIT Dump to profile oneDNN samples on CPU

This module shows the elapsed time percentage over different oneDNN primitives
<img src="images/cpu.JPG" style="float:left" width=600>


This module also shows the elapsed time percentage over different oneDNN JIT or GPU kernels
<img src="images/cpu_jit.JPG" style="float:left" width=400>
<img src="images/gpu_kernel.JPG" style="float:right" width=400>

***
# Verbose Mode Exercise



## prerequisites
***
### Step 1: Prepare the build/run environment
oneDNN has four different configurations inside the Intel oneAPI toolkits. Each configuration is in a different folder under the oneDNN installation path, and each configuration supports a different compiler or threading library  

Set the installation path of your oneAPI toolkit

In [None]:
# ignore all warning messages
import warnings
warnings.filterwarnings('ignore')

In [None]:
# default path: /opt/intel/oneapi
%env ONEAPI_INSTALL=/opt/intel/oneapi

In [None]:
import os
if os.path.isdir(os.environ['ONEAPI_INSTALL']) == False:
    print("ERROR! wrong oneAPI installation path")

In [None]:
!printf '%s\n'     $ONEAPI_INSTALL/dnnl/latest/cpu_*

As you can see, there are four different folders under the oneDNN installation path, and each of those configurations supports different features. This tutorial will use the dpcpp configuration to showcase the verbose log for both CPU and GPU.

Create a lab folder for this exercise.

In [None]:
!rm -rf lab;mkdir -p lab

Install required python packages.

In [None]:
!pip3 install -r requirements.txt

Get current platform information for this exercise.

In [None]:
from profiling.profile_utils import PlatformUtils
plat_utils = PlatformUtils()
plat_utils.dump_platform_info()

###  Step 2: Preparing the samples code

This exercise uses the cnn_inference_f32.cpp example from oneDNN installation path.

The section below will copy the cnn_inference_f32.cpp file into the lab folder.  
This section also copies the required header files and CMake file into the lab folder.

In [None]:
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_dpcpp_gpu_dpcpp/examples/cnn_inference_f32.cpp lab/
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_dpcpp_gpu_dpcpp/examples/example_utils.hpp lab/
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_dpcpp_gpu_dpcpp/examples/example_utils.h lab/
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_dpcpp_gpu_dpcpp/examples/CMakeLists.txt lab/

### Step 3: Build and Run with the oneAPI DPC++ Compiler 
One of the oneDNN configurations supports the oneAPI DPC++ compiler, and it can run on different architectures by using DPC++.
The following section shows you how to build with DPC++ and run on different architectures.

#### Script - build.sh
The script **build.sh** encapsulates the compiler **dpcpp** command and flags that will generate the exectuable.
To enable use of the DPC++ compiler and the related SYCL runtime, some definitions must be passed as cmake arguments.
Here are the related cmake arguments for the DPC++ configuration: 

   -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=dpcpp -DDNNL_CPU_RUNTIME=SYCL -DDNNL_GPU_RUNTIME=SYCL

In [None]:
%%writefile build.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --force> /dev/null 2>&1
export EXAMPLE_ROOT=./lab/
mkdir dpcpp
cd dpcpp
cmake .. -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=dpcpp -DDNNL_CPU_RUNTIME=SYCL -DDNNL_GPU_RUNTIME=SYCL
make cnn-inference-f32-cpp 


Once you achieve an all-clear from your compilation, you execute your program on the DevCloud or a local machine.

#### Script - run.sh
The script **run.sh** encapsulates the program for submission to the job queue for execution.
By default, the built program uses CPU as the execution engine, but the user can switch to GPU by specifying the input argument "gpu".
The user can refer to run.sh below to run cnn-inference-f32-cpp on both CPU and GPU.

In [None]:
%%writefile run.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --force > /dev/null 2>&1
echo "########## Executing the run"
# enable verbose log
export DNNL_VERBOSE=0
./dpcpp/out/cnn-inference-f32-cpp cpu
./dpcpp/out/cnn-inference-f32-cpp gpu
echo "########## Done with the run"



#### OPTIONAL : replace $ONEAPI_INSTALL with set value in both build.sh and run.sh
> NOTE : this step is mandatory if you run the notebook on DevCloud

In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('build.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )
file_utils.replace_string_in_file('run.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )



#### Submitting **build.sh** and **run.sh** to the job queue
Now we can submit **build.sh** and **run.sh** to the job queue.
##### NOTE - it is possible to execute any of the build and run commands in local environments.
To enable users to run their scripts either on the Intel DevCloud or in local environments, this and subsequent training checks for the existence of the job submission command **qsub**.  If the check fails, it is assumed that build/run will be local.

In [None]:
! rm -rf dpcpp;chmod 755 q; chmod 755 build.sh; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q build.sh; ./q run.sh; else ./build.sh; ./run.sh; fi

  
## Enable Verbose Mode
***
In this section, we enable verbose mode on the built sample from the previous section, and users can see different results from CPU and GPU.  
Refer to the [link](https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html) for detailed verbose mode information

When the feature is enabled at build-time, you can use the DNNL_VERBOSE environment variable to turn verbose mode on and control the level of verbosity.

|Environment variable|Value|Description|
|:-----|:----|:-----|
|DNNL_VERBOSE| 0 |no verbose output (default)|
||1|primitive information at execution|
||2|primitive information at creation and execution|


prepare run.sh and enable DNNL_VERBOSE as 2

In [None]:
%%writefile run.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --force > /dev/null 2>&1
echo "########## Executing the run"
# enable verbose log
export DNNL_VERBOSE=2 
./dpcpp/out/cnn-inference-f32-cpp cpu >>log_cpu_f32_vb2.csv 2>&1
./dpcpp/out/cnn-inference-f32-cpp gpu >>log_gpu_f32_vb2.csv 2>&1

echo "########## Done with the run"



#### OPTIONAL : replace $ONEAPI_INSTALL with set value in run.sh
> NOTE : this step is mandatory if you run the notebook on DevCloud

In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('run.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )



#### Submitting **run.sh** to the job queue
Now we can submit **run.sh** to the job queue.
##### NOTE - it is possible to execute any of the build and run commands in local environments.
To enable users to run their scripts either on the Intel DevCloud or in local environments, this and subsequent training checks for the existence of the job submission command **qsub**.  If the check fails, it is assumed that build/run will be local.

In [None]:
! chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q run.sh; else ./run.sh; fi

## Analyze Verbose Logs
***


### Step 1: List out all oneDNN verbose logs
users should see two verbose logs listed in the table below.

|Log File Name | Description |
|:-----|:----|
|log_cpu_f32_vb2.csv| log for cpu run |
|log_cpu_f32_vb2.csv| log for gpu run|

In [None]:
import os
filenames= os.listdir (".") 
result = []
keyword = ".csv"
for filename in filenames: 
    #if os.path.isdir(os.path.join(os.path.abspath("."), filename)): 
    if filename.find(keyword) != -1:
        result.append(filename)
result.sort()

index =0 
for folder in result:
    print(" %d : %s " %(index, folder))
    index+=1

### Step 2:  Pick a verbose log by putting its index value below
Users can pick either cpu or gpu log for analysis.   
Once users finish Step 2 to Step 8 for one log file, they can go back to step 2 and select another log file for analysis.

In [None]:
FdIndex=0

#### OPTIONAL: browse the content of selected verbose log.

In [None]:
logfile = result[FdIndex]
with open(logfile) as f:
    log = f.read()
print(log)

### Step 3: Parse verbose log and get the data back

In [None]:
logfile = result[FdIndex]
print(logfile)
from profiling.profile_utils import oneDNNUtils, oneDNNLog
onednn = oneDNNUtils()
log1 = oneDNNLog()
log1.load_log(logfile)
data = log1.data
exec_data = log1.exec_data


### Step 4: Time breakdown for exec type
The exec type includes exec and create. 

|exec type | Description |  
|:-----|:----|  
|exec | Time for primitives exection. Better to spend most of time on primitives execution. |  
|create| Time for primitives creation. Primitives creation happens once. Better to spend less time on primitive creation. |  

In [None]:
onednn.breakdown(data,"exec","time")

### Step 5: Time breakdown for primitives type
The primitives type includes convolution, reorder, sum, etc.  
For this simple convolution net example, convolution and inner product primitives are expected to spend most of time.  
However, the exact time percentage of different primitivies may vary among different architectures.    
Users can easily identify top hotpots of primitives executions with this time breakdown.  

In [None]:
onednn.breakdown(exec_data,"type","time")

### Step 6:  Time breakdown for JIT kernel type

oneDNN uses just-in-time compilation (JIT) to generate optimal code for some functions based on input parameters and instruction set supported by the system.   
Therefore, users can see different JIT kernel type among different CPU and GPU architectures.  
For example, users can see avx_core_vnni JIT kernel if the workload uses VNNI instruction on Cascake Lake platform.  
Users can also see different OCL kernels among different Intel GPU generations.  
Moreover, users can identify the top hotspots of JIT kernel executions with this time breakdown.  


In [None]:
onednn.breakdown(exec_data,"jit","time")

### Step 7:  Time breakdown for algorithm type
oneDNN also supports different algorithms.  
Users can identify the top hotspots of algorthm executions with this time breakdown.  

In [None]:
onednn.breakdown(exec_data,"alg","time")

### Step 8: Time breakdown for architecture type
The supported architectures include CPU and GPU.  
For this simple net sample, we don't split computation among CPU and GPU,    
so users should see either 100% CPU time or 100% GPU time. 

In [None]:
onednn.breakdown(data,"arch","time")

***
## Inspecting JIT Code

In this section, we dump JIT code  on the built sample from the previous section, and users can see different results from CPU.    
Refer to the [link](https://oneapi-src.github.io/oneDNN/dev_guide_inspecting_jit.html) for detailed JIT Dump information

When the feature is enabled at build-time, you can use the DNNL_JIT_DUMP environment variable to inspect JIT code.

|Environment variable|Value|Description|
|:-----|:----|:-----|
|DNNL_JIT_DUMP | 0 |JIT dump is disabled (default)|
||any other value|JIT dump is enabled|



#### Step 1: Prepare run.sh and enable DNNL_JIT_DUMP as 1

In [None]:
%%writefile run.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --force > /dev/null 2>&1
echo "########## Executing the run"
# disable verbose log
export DNNL_VERBOSE=0
# enable JIT Dump
export DNNL_JIT_DUMP=1 
./dpcpp/out/cnn-inference-f32-cpp cpu
echo "########## Done with the run"



#### OPTIONAL : replace $ONEAPI_INSTALL with set value in run.sh
> NOTE : this step is mandatory if you run the notebook on DevCloud

In [None]:
from profiling.profile_utils import FileUtils
file_utils = FileUtils()
file_utils.replace_string_in_file('run.sh','$ONEAPI_INSTALL', os.environ['ONEAPI_INSTALL'] )



#### Step 2: Submitting **run.sh** to the job queue
Now we can submit **run.sh** to the job queue.

In [None]:
! chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q run.sh; else ./run.sh; fi

#### Step 3: Move all JIT Dump files into the jitdump folder

In [None]:
!mkdir jitdump;mv *.bin jitdump

#### Step 4: List out all oneDNN JIT Dump files

In [None]:
import os
filenames= os.listdir ("jitdump") 
result = []
keyword = ".bin"
for filename in filenames: 
    #if os.path.isdir(os.path.join(os.path.abspath("."), filename)): 
    if filename.find(keyword) != -1:
        result.append(filename)
result.sort()

index =0 
for folder in result:
    print(" %d : %s " %(index, folder))
    index+=1

#### Step 5: Pick a JIT Dump file by putting its index value below

In [None]:
FdIndex=0

#### Step 6: export JIT Dump file to environment variable JITFILE

In [None]:
logfile = result[FdIndex]
os.environ["JITFILE"] = logfile

#### Step 7: disassembler JIT Dump file to view the code

> NOTE: If the oneDNN sample uses VNNI instruction, users should be able to see "vpdpbusd" instruction from the JIT Dump file  

> NOTE: If the oneDNN sample uses BF16 instruction, users should see usage of vdpbf16ps or vcvtne2ps2bf16 in the JIT dump file.  


> NOTE: For disassembler vdpbf16ps and vcvtne2ps2bf16 instructions, users must use objdump with v2.34 or above.

In [None]:
!objdump -D -b binary -mi386:x86-64 jitdump/$JITFILE

***
# Summary
In this lab the developer learned the following:
* how to use Verbose Mode to profile different oneDNN samples on CPU and GPU
* how to inspect JIT Dump to profile oneDNN samples on CPU
