# Analyze ISA usage with IntelÂ® oneAPI Deep Neural Network Library (oneDNN) Samples by using CPU Dispatcher Control

## Learning Objectives
In this module the developer will:
* Learn how to use CPU Dispatch Control to generate JIT codes among different Instruction Set Architecture (ISA) on CPU
* Analyze different JIT Kernel and CPU instructions usage among different ISA
    - AVX512 vs AVX2
    - AVX512 VNNI vs AVX512
    - AVX512 BF16 vs AVX512 (Optional, no hardware support in DevCloud now.)


This module also shows the elapsed time percentage over different oneDNN JIT kernels, so users can also see the usage of specific JIT Kernels for VNNI or BF16 instructions.

<img src="images/vnni.JPG" style="float:left" width=400>
<img src="images/bf16.JPG" style="float:right" width=400>


# CPU Dispatch Control and ISA Analysis Exercise



## prerequisites
****
### Step 1: Prepare the build/run environment
oneDNN has four different configurations inside the Intel oneAPI toolkits. Each configuration is in a different folder under the oneDNN installation path, and each configuration supports a different compiler or threading library  

Set the installation path of your oneAPI toolkit

In [None]:
# default path: /opt/intel/oneapi
%env ONEAPI_INSTALL=/opt/intel/oneapi

In [None]:
!printf '%s\n'     $ONEAPI_INSTALL/dnnl/latest/cpu_*

As you can see, there are four different folders under the oneDNN installation path, and each of those configurations supports different features. This tutorial will use the cpu_gomp configuration to do ISA analysis on CPU.

Create a lab folder for this exercise.

In [None]:
!mkdir -p lab

Install required python packages.

In [None]:
!pip install -r requirements.txt

Get current platform information for this exercise.

In [None]:
from profiling.profile_utils import PlatformUtils
plat_utils = PlatformUtils()
plat_utils.dump_platform_info()

###  Step 2: Preparing the samples code

This exercise uses the cnn_inference_f32.cpp and cnn_inference_int8.cpp examples from the oneDNN installation path.

The section below will copy the cnn_inference_f32.cpp and cnn_inference_int8.cpp files into lab folder.  
This section also copies the required header files and CMake file into the lab folder.

In [None]:
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_gomp/examples/cnn_inference_f32.cpp lab/
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_gomp/examples/cnn_inference_int8.cpp lab/
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_gomp/examples/cpu_cnn_training_bf16.cpp lab/
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_gomp/examples/example_utils.hpp lab/
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_gomp/examples/example_utils.h lab/
!cp $ONEAPI_INSTALL/dnnl/latest/cpu_gomp/examples/CMakeLists.txt lab/

### Step 3: Build and Run with GNU Compiler and OpenMP 
One of the oneDNN configurations supports the GNU Compiler.
The following section shows you how to build with the GNU Compiler and run on CPU.

#### Script - build.sh
The script **build.sh** encapsulates the compiler **g++** command and flags that will generate the exectuable.
In order to use GNU compiler and related OMP runtime, some definitions must be passed as cmake arguments.
Here are related cmake arguments for cpu_gomp configuration: 

   -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DDNNL_CPU_RUNTIME=OMP -DDNNL_GPU_RUNTIME=NONE

In [None]:
%%writefile build.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force> /dev/null 2>&1
export EXAMPLE_ROOT=./lab/
mkdir cpu_gomp
cd cpu_gomp
cmake .. -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DDNNL_CPU_RUNTIME=OMP -DDNNL_GPU_RUNTIME=NONE
make



Once you achieve an all-clear from your compilation, you execute your program on the DevCloud or a local machine.

#### Script - run.sh
The script **run.sh** encapsulates the program for submission to the job queue for execution.
The user can refer to run.sh below to run cnn-inference-f32-cpp on CPU.

In [None]:
%%writefile run.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force > /dev/null 2>&1
echo "########## Executing the run"
# enable verbose log
export DNNL_VERBOSE=0
./cpu_gomp/out/cnn-inference-f32-cpp
./cpu_gomp/out/cnn-inference-int8-cpp
./cpu_gomp/out/cpu-cnn-training-bf16-cpp
echo "########## Done with the run"



#### Submitting **build.sh** and **run.sh** to the job queue
Now we can submit **build.sh** and **run.sh** to the job queue.
##### NOTE - it is possible to execute any of the build and run commands in local environments.
To enable users to run their scripts either on the Intel DevCloud or in local environments, this and subsequent training checks for the existence of the job submission command **qsub**.  If the check fails, it is assumed that build/run will be local.

In [None]:
! rm -rf dpcpp;chmod 755 q; chmod 755 build.sh; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q build.sh; ./q run.sh; else ./build.sh; ./run.sh; fi



## Run Time CPU Dispatcher Controls
***
In this section, we run workloads on the latest Xeon server from DevCloud, and use CPU dispatcher controls to generate JIT kernels among different ISA for comparison.
Users will understand the usage of different ISA by analyzing oneDNN Verbose logs and JIT Dump files.
Refer to the [link](https://oneapi-src.github.io/oneDNN/dev_guide_cpu_dispatcher_control.html) for detailed CPU Dispatcher Controls information

When the feature is enabled at build-time, you can use the DNNL_MAX_CPU_ISA environment variable to limit processor features. oneDNN is able to detect to certain Instruction Set Architecture (ISA) and older instruction sets. It can also be used to enable ISAs with initial support in the library that are otherwise disabled by default.

|Environment variable Value|Description| introduced with microarchitecture |
|:----|:-----|:-----|
|SSE41|Intel Streaming SIMD Extensions 4.1 (Intel SSE4.1)| Penryn |
|AVX|Intel Advanced Vector Extensions (Intel AVX)|Sandy Bridge |
|AVX2|Intel Advanced Vector Extensions 2 (Intel AVX2)| Haswell |
|AVX512_CORE|Intel AVX-512 with AVX512BW, AVX512VL, and AVX512DQ extensions| Skylake-X |
|AVX512_CORE_VNNI|Intel AVX-512 with Intel Deep Learning Boost (Intel DL Boost)| Cascade Lake |
|AVX512_CORE_BF16|Intel AVX-512 with Intel DL Boost and bfloat16 support| Cooper Lake |
|ALL|No restrictions on the above ISAs, but excludes the below ISAs with initial support in the library (default)| |
|AVX512_CORE_AMX|Intel AVX-512 with Intel DL Boost and bfloat16 support and Intel Advanced Matrix Extensions (Intel AMX) with 8-bit integer and bfloat16 support (initial support) | |




## ISA Comparison
***

The section below compares and analyzes different ISA upon JIT Kernel usage and CPU instruction usage.

The table below shows the different comparison by using different oneDNN samples,   
and also brings up the keypoint of the comparison. 

|ISA Comparation | oneDNN sample | Description | 
|:----|:-----|:-----|
|AVX512 vs AVX2 |cnn-inference-f32-cpp| show the usage of zmm instruction and avx512 JIT kernel | 
|AVX512 VNNI vs AVX512 |cnn-inference-int8-cpp| show the usage of VNNI instruction and VNNI JIT kernel|
|AVX512 BF16 vs AVX512| cnn-training-bf16-cpp| show the usage of BF16 instruction and BF16 JIT kernel| 

Those comparisons can be conducted on the same CPU microarchitecture with the help of oneDNN CPU dispatcher control.  
Users can also conduct similiar comparisons for TensorFlow or PyTorch workloads by replacing the oneDNN sample with other workloads.  
By conducting similar comparisons of real workloads, users can understand:  
* Whether the workloads leverage the latest instructions like VNNI on the platform
* How much performance benefit is gained by using the latest instruction on the same platform


### Step 1: Pick one of ISA comparisons
After users pick an ISA comparison, related environment variables will be exported.  
  
The section below will list out all ISA comparison options with index number.

In [None]:
ISA_COMPARISON_LIST=["avx512_avx2","avx512-vnni_avx512","avx512-bf16_avx512"]
index =0 
for ISA_C in ISA_COMPARISON_LIST:
    print(" %d : %s " %(index, ISA_C))
    index+=1

Please select a comparison option and assign its index to the ISAIndex variable.
>NOTE: no bf16 support in DevCloud now. Please **IGNORE avx512-bf16_avx512** comparison.

In [None]:
ISAIndex=0

The section below will export related environment variables according to the selected ISA comparison.

In [None]:
ISA_COMPARISON = ISA_COMPARISON_LIST[ISAIndex]
print(" Compare between ", ISA_COMPARISON)
import os
if ISA_COMPARISON == "avx512_avx2":
    # variables for AVX2
    os.environ["DNNL_MAX_CPU_ISA_VAL1"] = "AVX2"
    os.environ["DNNL_APP_VAL1"] = "cnn-inference-f32-cpp"
    os.environ["DNNL_LOG_VAL1"] = "log_cpu_f32_avx2.csv"
    os.environ["DNNL_JIT_FD_VAL1"] = "jitdump_f32_avx2"
    # variables for AVX512
    os.environ["DNNL_MAX_CPU_ISA_VAL2"] = "AVX512_CORE"
    os.environ["DNNL_APP_VAL2"] = "cnn-inference-f32-cpp"
    os.environ["DNNL_LOG_VAL2"] = "log_cpu_f32_avx512.csv"
    os.environ["DNNL_JIT_FD_VAL2"] = "jitdump_f32_avx512"
    # AVX512 specific register
    os.environ["DNNL_ISA_KEYWORD"] = "zmm"
    
elif ISA_COMPARISON == "avx512-vnni_avx512":
    # variables for AVX512
    os.environ["DNNL_MAX_CPU_ISA_VAL1"] = "AVX512_CORE"
    os.environ["DNNL_APP_VAL1"] = "cnn-inference-int8-cpp"
    os.environ["DNNL_LOG_VAL1"] = "log_cpu_int8_avx512.csv"
    os.environ["DNNL_JIT_FD_VAL1"] = "jitdump_int8_avx512"
    # variables for AVX512 VNNI
    os.environ["DNNL_MAX_CPU_ISA_VAL2"] = "AVX512_CORE_VNNI"
    os.environ["DNNL_APP_VAL2"] = "cnn-inference-int8-cpp"
    os.environ["DNNL_LOG_VAL2"] = "log_cpu_int8_avx512_vnni.csv"
    os.environ["DNNL_JIT_FD_VAL2"] = "jitdump_int8_avx512_vnni"
    # VNNI specific instruction
    os.environ["DNNL_ISA_KEYWORD"] = "vpdpbusd"   
    
elif ISA_COMPARISON == "avx512-bf16_avx512":
    # variables for AVX512
    os.environ["DNNL_MAX_CPU_ISA_VAL1"] = "AVX512_CORE"
    os.environ["DNNL_APP_VAL1"] = "cpu-cnn-training-bf16-cpp"
    os.environ["DNNL_LOG_VAL1"] = "log_cpu_bf16_avx512.csv"
    os.environ["DNNL_JIT_FD_VAL1"] = "jitdump_bf16_avx512"
    # variables for AVX512 BF16
    os.environ["DNNL_MAX_CPU_ISA_VAL2"] = "AVX512_CORE_BF16"
    os.environ["DNNL_APP_VAL2"] = "cpu-cnn-training-bf16-cpp"
    os.environ["DNNL_LOG_VAL2"] = "log_cpu_bf16_avx512_bf16.csv"
    os.environ["DNNL_JIT_FD_VAL2"] = "jitdump_bf16_avx512_bf16"
    # BF16 specific instructions
    os.environ["DNNL_ISA_KEYWORD"] = "vdpbf16ps|vcvtne2ps2bf16"        

### Step 2: Script - run.sh for first selected ISA.    ex: AVX2, or AVX512_CORE
****
The script **run.sh** encapsulates the program for submission to the job queue for execution.
The user can refer to run.sh below to run the oneDNN sample on CPU with the selcted ISA.

  
print out the selected ISA.

In [None]:
! echo $DNNL_MAX_CPU_ISA_VAL1

prepare run.sh and use DNNL_MAX_CPU_ISA to run sample on selected ISA.

In [None]:
%%writefile run.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force  > /dev/null 2>&1
echo "########## Executing the run"
# enable verbose log
export DNNL_VERBOSE=2 
# enable JIT Dump
export DNNL_JIT_DUMP=1

DNNL_MAX_CPU_ISA=$DNNL_MAX_CPU_ISA_VAL1 ./cpu_gomp/out/$DNNL_APP_VAL1 cpu >> $DNNL_LOG_VAL1 2>&1

echo "########## Done with the run"



#### Submitting  **run.sh** to the job queue
> NOTE: By assigning clx to property, users can execute the sample on a Cascade Lake platform from Intel DevCloud.

In [None]:
! export property=clx; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q run.sh; else ./run.sh; fi

####  gather all JIT bin files into a folder

In [None]:
! rm -rf $DNNL_JIT_FD_VAL1; mkdir $DNNL_JIT_FD_VAL1; mv *.bin $DNNL_JIT_FD_VAL1

### Step 3: Script - run.sh for second selected ISA. ex: AVX512_CORE_VNNI or AVX512_CORE_BF16
**** 
The script **run.sh** encapsulates the program for submission to the job queue for execution.
The user can refer to run.sh below to run the oneDNN sample on CPU with the selcted ISA.

  
print out the selected ISA.

In [None]:
! echo $DNNL_MAX_CPU_ISA_VAL2

prepare run.sh and use DNNL_MAX_CPU_ISA to run sample on selected ISA.

In [None]:
%%writefile run.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --dnnl-configuration=cpu_gomp --force  > /dev/null 2>&1
echo "########## Executing the run"
# enable verbose log
export DNNL_VERBOSE=2 
# enable JIT Dump
export DNNL_JIT_DUMP=1

DNNL_MAX_CPU_ISA=$DNNL_MAX_CPU_ISA_VAL2 ./cpu_gomp/out/$DNNL_APP_VAL2 cpu >> $DNNL_LOG_VAL2 2>&1

echo "########## Done with the run"



#### Submitting  **run.sh** to the job queue
> NOTE: By assigning clx to property, users can execute the sample on a Cascade Lake platform from Intel DevCloud.


In [None]:
! export property=clx; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q run.sh; else ./run.sh; fi

####  gather all JIT bin files into a folder

In [None]:
! rm -rf $DNNL_JIT_FD_VAL2; mkdir $DNNL_JIT_FD_VAL2; mv *.bin $DNNL_JIT_FD_VAL2

****
### Step 4: oneDNN Verbose Log JIT Kernel Time BreakDown
oneDNN uses just-in-time compilation (JIT) to generate optimal code for some functions based on input parameters and instruction set supported by the system.   
Therefore, users can see different JIT kernel type among different first selected ISA and second selected ISA.   
For example, users can see avx_core_vnni JIT kernel if the workload uses VNNI instruction on Cascake Lake platform.  
Moreover, users can identify the top hotspots of JIT kernel executions with this time breakdown. 

#### Parse verbose log and get the data back

In [None]:
from profiling.profile_utils import oneDNNUtils, oneDNNLog
onednn = oneDNNUtils()

logfile1 = os.environ["DNNL_LOG_VAL1"]
log1 = oneDNNLog()
log1.load_log(logfile1)
exec_data1 = log1.exec_data

logfile2 = os.environ["DNNL_LOG_VAL2"]
log2 = oneDNNLog()
log2.load_log(logfile2)
exec_data2 = log2.exec_data


####   JIT Kernel Type Time breakdown for first selected ISA  


In [None]:
onednn.breakdown(exec_data1,"jit","time")

####   JIT Kernel Type Time breakdown for second selected ISA


> NOTE: users should be able to see **avx512_core_vnni** JIT Kernel if the sample run with **VNNI** instruction  
> NOTE: users should be able to see **avx512_core_bf16** JIT Kernel if the sample run with **BF16** instruction  
> NOTE: users should be able to see **avx512** JIT Kernel if the sample run with **AVX512** instructions  

In [None]:
onednn.breakdown(exec_data2,"jit","time")

####   Primitives Type Speedup from second selected ISA
oneDNN samples here are not for performance benchmarking, so the digram below gives you only a rough idea of performance speedup from the second selected ISA such as AVX512, VNNI, or BF16.

In [None]:
 onednn.stats_comp('type', 'time',log2, log1)

****
### Step 5: Inspect JIT Kernel 
In this section, we analyze dump JIT files on the built samples from Step 2 and Step 3.   
Users should be able to see exact CPU instruction usage like VNNI or BF16 from those JIT Dump files.

#### inspect either first or second selected ISA by setting VALIndex.

* To inspect the first selected ISA JIT Dump files, set VALIndex as 1.  
* To inspect second selected ISA JIT Dump files, set VALIndex as 2.  

In [None]:
VALIndex=2

#### List out all JIT Dump Files with index number for the first or second selected ISA

In [None]:
import os

VAL="DNNL_JIT_FD_VAL"+str(VALIndex)
JIT_DUMP_FD=os.environ[VAL]
print("Inspect Folder: ", JIT_DUMP_FD)

filenames= os.listdir (JIT_DUMP_FD) 
result = []
keyword = ".bin"
for filename in filenames: 
    #if os.path.isdir(os.path.join(os.path.abspath("."), filename)): 
    if filename.find(keyword) != -1:
        result.append(filename)
result.sort()

index =0 
for folder in result:
    print(" %d : %s " %(index, folder))
    index+=1

#### Pick a JIT Dump file by putting its index value below

In [None]:
FdIndex=0

#### export JIT Dump file to environment variable JITFILE

In [None]:
if FdIndex < len(result):
    logfile = result[FdIndex]
    os.environ["JITFILE"] = JIT_DUMP_FD+os.sep+logfile

#### disassembler JIT Dump file

> NOTE: zmm register is introduced by AVX512 ISA.  
Users should see usage of **zmm** register in AVX512 JIT dump files.  

> NOTE: vpdpbusd is introduced by AVX512_VNNI ISA.  
Users should see usage of **vpdpbusd** in AVX512_VNNI JIT dump files. 

> NOTE: **vdpbf16ps**, **vcvtne2ps2bf16**, and **vcvtneps2bf16** are introduced by AVX512_BF16 ISA.  
Users should see usage of vdpbf16ps, vcvtne2ps2bf16 or vcvtneps2bf16 in AVX512_BF16 JIT dump files. 

> NOTE: For disassembler vdpbf16ps, vcvtne2ps2bf16, and vcvtneps2bf16 instructions, users must use objdump with **v2.34** or above.

In [None]:
!objdump -D -b binary -mi386:x86-64 $JITFILE | grep -E $DNNL_ISA_KEYWORD

***
# Summary
In this lab the developer learned the following:
* use CPU Dispatch Control to generate JIT codes among different Instruction Set Architecture on CPU
* understand different JIT Kernels and CPU instructions usage among different ISA
