In [None]:
# =============================================================
# Copyright © 2023 Intel Corporation
# 
# SPDX-License-Identifier: MIT
# =============================================================

# Tensorflow Transformer with AMX bfoat16 Mixed Precision Learning

In this example we will be learning Transformer block for text classification using **IMBD dataset**. And then we will modify the code to use mixed precision learning with **bfloat16**. The example based on the [Text classification with Transformer Keras code example](https://keras.io/examples/nlp/text_classification_with_transformer/).

To start this sample, make sure you have installed [Intel® oneAPI AI Analytics Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html) For more informations and istructions please follow [Get Started with the Intel® AI Analytics Toolkit](https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-ai-linux/top.html).

Also, we need to check if Jupyter Notebook runs on  4th Gen Intel® Xeon® Scalable Processors (Sapphire Rapids). The code below will return the specific architecture the Notebook is running on. If it returns SPR you are ready to go with the rest of the sample.

In [None]:
from cpuinfo import get_cpu_info

info = get_cpu_info()
flags = info['flags']
arch_list = ['SPR', 'CPX',"ICX|CLX", "SKX", "BDW|CORE|ATOM"]
isa_list = [['amx_bf16', 'amx_int8', 'amx_tile'],['avx512_bf16'],['avx512_vnni'],['avx512'],['avx2']]
index = len(arch_list) - 1
for flag in flags:
    for idx, isa_sublist in enumerate(isa_list):
        for isa in isa_sublist:
            if isa in flag:
                if idx < index:
                    index = idx
arch = arch_list[index]

print(arch)

Let's start by downloading the sample from the keras.io github.

In [None]:
!wget https://raw.githubusercontent.com/keras-team/keras-io/master/examples/nlp/text_classification_with_transformer.py

## Existing example explanation
The example implements a Transformer block as a layer. Transformer block consists of layers of Self Attention, feed-forward (i.e., Dense) and Normalization. Example uses the `TransformerBlock` provided by `keras`.

Later it implements embedding layer. There are two seperate embedding layers:
* one for tokens,
* one for token index (positions).
In Transformer-based networks, we need to include positional information of the tokens in the embeddings. There is used the `TokenAndPositionEmbedding` provided in `keras`.

In next step **IMDB dataset** is download. It contains 50,000 movie reviews with 2 classes (positive and negative). There is provided a set of 25,000 texts for training and 25,000 for testing. Only top 20,000 words as a vocabulary size and only first 200 words of each movie review are considered in the example.

The following step is to create classifier. Transformer layer outputs one vector for each time step of our input sequence. Here, there is taken the mean across all time steps and use a feed forward network on top of it to classify text.

At the last step is to train and evaluate the model.


## Performance measure

To show benefits in performance by using bfloat16 mixed precision learning let's measure the time needed to learn the model. We need to apply code which creates a variable to store the times before and after learning, and then prints the difference to the standard output availabe in the prepared patch file `time.patch`. Let's look on the prepared file:

In [None]:
!cat ./patch/time.patch

So let's apply time measure to the keras.io sample.

In [None]:
!patch text_classification_with_transformer.py ./patch/time.patch

#### Run sample
The script `job.sh` encapsulates the program for subbmission to the job queue for execution. 

To collect information about how much of an application runtime is spent executing oneDNN primitives and which of those take the mosf time we are using oneDNN verbose mode:

* `ONEDNN_VERBOSE=1` - to enable primitive information at execution primitive information at creation and execution,
* `ONEDNN_VERBOSE_TIMESTAMP=1` - to display timestamp.

The whole output of the program will be saved in dedicated logs file.

In [None]:
%%writefile job.sh
#!/bin/bash

mkdir logs

wget https://raw.githubusercontent.com/IntelAI/models/master/benchmarks/common/platform_util.py

echo "########## Executing the run"

source /opt/intel/oneapi/setvars.sh
source activate tensorflow

ONEDNN_VERBOSE_TIMESTAMP=1 ONEDNN_VERBOSE=1 python ./text_classification_with_transformer.py > ./logs/dnn_logs.txt

echo "########## Done with the run"

#### Submitting job.sh to the job queue

Now we can submit `job.sh` to the job queue.

**NOTE - it is possible to any of the run commands in local environments.**

To enable users to run their scripts either on the Intel DevCloud or in local environments, this and subsequent training checks for the existence of the job submission command qsub. If the check fails, it is assumed that run will be local.

In [None]:
!export property=spr; chmod 755 q; chmod 755 job.sh; if [ -x "$(command -v qsub)" ]; then ../../q job.sh; else ./job.sh; fi

## Modification for mixed precision learning using bfloat16

To use bfloat16 mixed precision learning we need to add the following lines:

```python
from tensorflow.keras import mixed_precision

policy = mixed_precision.Policy('mixed_bfloat16')
mixed_precision.set_global_policy(policy)
```

availabe in the prepared patch file `mixed_precision.patch` and the rest of the code should stay the same. So let's take a look what's in the prepared file.

In [None]:
!cat ./patch/mixed_precision.patch

And apply it to the downloaded example from the keras.io.

In [None]:
!patch text_classification_with_transformer.py ./patch/mixed_precision.patch

#### Run sample and submit script to the job queue
Let's use script similar to `job.sh` that we prepared already and submit updated version of the text classification sample with bfloat16 mixed precision learning. We will only change file for saving logs.

In [None]:
%%writefile job_mixed.sh
#!/bin/bash

echo "########## Executing the run"

source /opt/intel/oneapi/setvars.sh
source activate tensorflow

ONEDNN_VERBOSE_TIMESTAMP=1 ONEDNN_VERBOSE=1 python ./text_classification_with_transformer.py > ./logs/dnn_logs_mixed.txt

echo "########## Done with the run"

In [None]:
!export property=spr; chmod 755 q; chmod 755 job_mixed.sh; if [ -x "$(command -v qsub)" ]; then ../../q job_mixed.sh; else ./job_mixed.sh; fi

### Performance comparison

Now let's parse `job.sh` outputs to compare the learning times of the models.

In [None]:
!cat ./logs/dnn_logs.txt | grep "time: "; cat ./logs/dnn_logs_mixed.txt | grep "time: "

There are shown times of learning text classification sample. First of times is for **float32 learning**, and second is time of model learning using the **bfloat16 mixed precision**.
As we can see time for using bfloat16 mixed precision learning is better than with float32 learning process, which shows the performance improvement with AMX and bfloat16 usage. 

## ISA Comparison
The section below compares and analyzes different ISA upon JIT Kernel usage and CPU instruction usage.

Those comparisons can be conducted on the same CPU microarchitecture with the help of oneDNN CPU dispatcher control.

### AMX run

First, we will run the same example on the maximum available CPU ISA, i.e., on AMX by setting `DNNL_MAX_CPU_ISA` to `AVX512_CORE_AMX` and also pointing to the corresponding file where the statistics of the execution of the example `./logs/log_cpu_bf16_avx512_amx.csv` will be saved.

In [None]:
%%writefile run_amx.sh
#!/bin/bash

echo "########## Executing the run"

source activate tensorflow

# enable verbose log
export DNNL_VERBOSE=2 
# enable JIT Dump
export DNNL_JIT_DUMP=1

DNNL_MAX_CPU_ISA=AVX512_CORE_AMX python ./text_classification_with_transformer.py cpu >> ./logs/log_cpu_bf16_avx512_amx.csv 2>&1

echo "########## Done with the run"

In [None]:
!export property=spr; chmod 755 q; chmod 755 run_amx.sh; if [ -x "$(command -v qsub)" ]; then ../../q run_amx.sh; else ./run_amx.sh; fi

### AVX512 BF16 run

Next, we will run this example on maximum by setting the maximum CPU ISA to AVX512 BF16, setting `DNNL_MAX_CPU_ISA` to `AVX512_CORE_BF16` and also pointing to the appropriate file to save the statistics of the example `./logs/log_cpu_bf16_avx512_bf16.csv`.

In [None]:
%%writefile run.sh
#!/bin/bash

echo "########## Executing the run"

source activate tensorflow

# enable verbose log
export DNNL_VERBOSE=2 
# enable JIT Dump
export DNNL_JIT_DUMP=1

DNNL_MAX_CPU_ISA=AVX512_CORE_BF16 python ./text_classification_with_transformer.py cpu >> ./logs/log_cpu_bf16_avx512_bf16.csv 2>&1

echo "########## Done with the run"

In [None]:
!export property=spr; chmod 755 q; chmod 755 run.sh; if [ -x "$(command -v qsub)" ]; then ../../q run.sh; else ./run.sh; fi

#### oneDNN Verbose Log JIT Kernel Time BreakDown

oneDNN uses just-in-time compilation (JIT) to generate optimal code for some functions based on input parameters and instruction set supported by the system.
Therefore, users can see different JIT kernel type among different first selected ISA and second selected ISA.

To decrypt oneDNN verbose output we are using created profiling tool - `profile_utils.py` file. Let's download it.

In [None]:
!wget https://raw.githubusercontent.com/oneapi-src/oneAPI-samples/master/Libraries/oneDNN/tutorials/profiling/profile_utils.py

We can parse verbose log and get the data back now.

In [None]:
from profile_utils import oneDNNUtils, oneDNNLog
onednn = oneDNNUtils()

logfile1 = './logs/log_cpu_bf16_avx512_bf16.csv'
log1 = oneDNNLog()
log1.load_log(logfile1)
exec_data1 = log1.exec_data

logfile2 = './logs/log_cpu_bf16_avx512_amx.csv'
log2 = oneDNNLog()
log2.load_log(logfile2)
exec_data2 = log2.exec_data

##### JIT Kernel Type Time breakdown for AVX512

In [None]:
onednn.breakdown(exec_data1,"jit","time")

##### JIT Kernel Type Time breakdown for AMX 

In [None]:
onednn.breakdown(exec_data2,"jit","time")

In [None]:
print("[CODE_SAMPLE_COMPLETED_SUCCESFULLY]")