## BERT-Large benchmark for Tensorflow

We use [Model Zoo](https://github.com/IntelAI/models) to run [BERT Large](https://github.com/IntelAI/models/tree/v1.8.1/benchmarks/language_modeling/tensorflow/bert_large/README.md) model on SQuADv1.1 datasets.

## Part 1. Download datasets, checkpoints and pre-trained model

Download datasets, checkpoints and pre-trained model from the Internet to Databricks File System (DBFS). These data can share across clusters and only need to download once. So if you run multiple copyed notebooks simultaneously, please ensure run this cell only once to avoid unpredicted issues.

In [None]:
%sh

# Download datasets, checkpoints and pre-trained model
rm -rf /bench/TF/bert-large
mkdir -p /bench/TF/bert-large

mkdir -p /bench/TF/bert-large/SQuAD-1.1
cd /bench/TF/bert-large/SQuAD-1.1
wget https://github.com/oap-project/oap-project.github.io/raw/master/resources/ai/bert/dev-v1.1.json
wget https://github.com/oap-project/oap-project.github.io/raw/master/resources/ai/bert/evaluate-v1.1.py
wget https://github.com/oap-project/oap-project.github.io/raw/master/resources/ai/bert/train-v1.1.json

cd /bench/TF/bert-large
wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_8/bert_large_checkpoints.zip
unzip bert_large_checkpoints.zip

cd /bench/TF/bert-large
wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
unzip wwm_uncased_L-24_H-1024_A-16.zip

### After the data have downloaded, you can start the BERT-Large training/inference workload.

## Part 2. Run BERT-Large training workload

In the beginning, data preprocessing will take some minutes. Once the preprocessing is done, the training workload can output throughput performance number in real time. It takes around five hours to complete the entire training process on Standard_F32s_v2 instance. The precise elapsed time depends on instance type and whether is Intel-optimized TensorFlow. 


**Note:** ***If you click "Stop Execution" for running training/inference cells, and then run training/inference again immediately. You may see lower performance number, because another training/inference is still on-going.***

In [None]:
import os
import subprocess

from pathlib import Path
  
def run_training():
  training = '/tmp/training.sh'
  with open(training, 'w') as f:
    f.write("""#!/bin/bash
    # BERT-Large Training
    # Install necessary package
    sudo apt-get update
    sudo apt-get install zip -y
    sudo apt-get -y install git
    sudo apt-get install -y libblacs-mpi-dev
    sudo apt-get install -y numactl
    
    # Remove old materials if exist
    rm -rf /TF/
    mkdir /TF/
    # Create ckpt directory
    mkdir -p /TF/BERT-Large-output/
    # Download IntelAI benchmark
    cd /TF/
    wget https://github.com/IntelAI/models/archive/refs/tags/v1.8.1.zip
    unzip v1.8.1.zip
    
    cores_per_socket=$(lscpu | awk '/^Core\(s\) per socket/{ print $4 }')
    numa_nodes=$(lscpu | awk '/^NUMA node\(s\)/{ print $3 }')
    export SQUAD_DIR=/bench/TF/bert-large/SQuAD-1.1
    export BERT_LARGE_MODEL=/bench/TF/bert-large/wwm_uncased_L-24_H-1024_A-16
    export BERT_LARGE_OUTPUT=/TF/BERT-Large-output/
    export PYTHONPATH=$PYTHONPATH:.
    
    function run_training_without_numabind() {
     python launch_benchmark.py \
        --model-name=bert_large \
        --precision=fp32 \
        --mode=training \
        --framework=tensorflow \
        --batch-size=4 \
        --benchmark-only \
        --data-location=$BERT_LARGE_MODEL \
        -- train-option=SQuAD  DEBIAN_FRONTEND=noninteractive   config_file=$BERT_LARGE_MODEL/bert_config.json   init_checkpoint=$BERT_LARGE_MODEL/bert_model.ckpt     vocab_file=$BERT_LARGE_MODEL/vocab.txt train_file=$SQUAD_DIR/train-v1.1.json     predict_file=$SQUAD_DIR/dev-v1.1.json      do-train=True learning-rate=1.5e-5   max-seq-length=384     do_predict=True warmup-steps=0     num_train_epochs=0.1     doc_stride=128      do_lower_case=False     experimental-gelu=False     mpi_workers_sync_gradients=True
    }

    function run_training_with_numabind() {
      intra_thread=`expr $cores_per_socket - 2`
      python launch_benchmark.py \
        --model-name=bert_large \
        --precision=fp32 \
        --mode=training \
        --framework=tensorflow \
        --batch-size=4 \
        --mpi_num_processes=$numa_nodes \
        --num-intra-threads=$intra_thread \
        --num-inter-threads=1 \
        --benchmark-only \
        --data-location=$BERT_LARGE_MODEL \
        -- train-option=SQuAD  DEBIAN_FRONTEND=noninteractive   config_file=$BERT_LARGE_MODEL/bert_config.json init_checkpoint=$BERT_LARGE_MODEL/bert_model.ckpt     vocab_file=$BERT_LARGE_MODEL/vocab.txt train_file=$SQUAD_DIR/train-v1.1.json     predict_file=$SQUAD_DIR/dev-v1.1.json      do-train=True learning-rate=1.5e-5   max-seq-length=384     do_predict=True warmup-steps=0     num_train_epochs=0.1     doc_stride=128      do_lower_case=False     experimental-gelu=False     mpi_workers_sync_gradients=True
    }
    
    # Launch Benchmark for training
    cd /TF/models-1.8.1/benchmarks/
    
    if [ "$numa_nodes" = "1" ];then
            run_training_without_numabind
    else
            run_training_with_numabind
    fi """)
    
  os.chmod(training, 555)
  os.system("ps -Af | grep launch_benchmark.py | grep -v grep | awk '{print $2}' | xargs kill -9")
  os.system("ps -Af | grep run_squad.py | grep -v grep | awk '{print $2}' | xargs kill -9")
  p = subprocess.Popen([training], stdin=None, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
  directory_to_second_numa_info = Path("/sys/devices/system/node/node1")
  
  if  directory_to_second_numa_info.exists():
    # 2 NUMA nodes
    for line in iter(p.stdout.readline, ''):
      if b"Reading package lists..." in line or b"answer: [UNK] 1848" in line:
        print("\t\t\t\t  Preparing data ......", end='\r')
      if b"INFO:tensorflow:examples/sec" in line:
        print("\t\t\t\t  Training started, current real-time throughput (examples/sec) : " + str(float(str(line).strip("\\n'").split(' ')[1])*2), end='\r')
      if line == b'' and p.poll() != None:
        break
  else:
    # 1 NUMA node
    for line in iter(p.stdout.readline, ''):
      if b"Reading package lists..." in line or b"answer: [UNK] 1848" in line:
        print("\t\t\t\t  Preparing data ......", end='\r')
      if b"INFO:tensorflow:examples/sec" in line:
        print("\t\t\t\t  Training started, current real-time throughput (examples/sec) : " + str(line).strip("\\n'").split(' ')[1], end='\r')
      if line == b'' and p.poll() != None:
        break
        
  p.stdout.close()

run_training()

## Part 3. Run BERT-Large inference workload

In the beginning, data preprocessing will take some minutes. Once the preprocessing is done, the inference workload can output throughput performance number in real time. It takes around 30 minutes to complete the entire inference process on Standard_F32s_v2 instance. The precise elapsed time depends on instance type and whether is Intel-optimized TensorFlow. 

**Note:** ***If you click "Stop Execution" for running training/inference cells, and then run training/inference again immediately. You may see lower performance number, because another training/inference is still on-going.***

In [None]:
import os
import subprocess

from pathlib import Path
  
def run_inference():
  inference = '/tmp/inference.sh'
  with open(inference, 'w') as f:
    f.write("""#!/bin/bash
    # BERT-Large Inference
    # Install necessary package
    sudo apt-get update
    sudo apt-get install zip -y
    sudo apt-get -y install git
    sudo apt-get install -y numactl
    # Remove old materials if exist
    rm -rf /TF/
    mkdir /TF/
    # Create ckpt directory
    mkdir -p /TF/BERT-Large-output/
    export BERT_LARGE_OUTPUT=/TF/BERT-Large-output
    # Download IntelAI benchmark
    cd /TF/
    wget https://github.com/IntelAI/models/archive/refs/tags/v1.8.1.zip
    unzip v1.8.1.zip
    cd /TF/models-1.8.1/
    wget https://github.com/oap-project/oap-tools/raw/master/integrations/ml/databricks/benchmark/IntelAI_models_bertlarge_inference_realtime_throughput.patch
    git apply IntelAI_models_bertlarge_inference_realtime_throughput.patch

    export SQUAD_DIR=/bench/TF/bert-large/SQuAD-1.1/
    export BERT_LARGE_DIR=/bench/TF/bert-large/
    export PYTHONPATH=$PYTHONPATH:.

    # Launch Benchmark for inference
    numa_nodes=$(lscpu | awk '/^NUMA node\(s\)/{ print $3 }')

    function run_inference_without_numabind() {
      cd /TF/models-1.8.1/benchmarks/
      python3 launch_benchmark.py \
        --model-name=bert_large \
        --precision=fp32 \
        --mode=inference \
        --framework=tensorflow \
        --batch-size=32 \
        --data-location $BERT_LARGE_DIR/wwm_uncased_L-24_H-1024_A-16 \
        --checkpoint $BERT_LARGE_DIR/bert_large_checkpoints \
        --output-dir $BERT_LARGE_OUTPUT/bert-squad-output \
        --verbose \
        -- infer_option=SQuAD \
           DEBIAN_FRONTEND=noninteractive \
           predict_file=$SQUAD_DIR/dev-v1.1.json \
           experimental-gelu=False \
           init_checkpoint=model.ckpt-3649
    }

    function run_inference_with_numabind() {
      cd /TF/models-1.8.1/benchmarks/
      nohup python3 launch_benchmark.py \
        --model-name=bert_large \
        --precision=fp32 \
        --mode=inference \
        --framework=tensorflow \
        --batch-size=32 \
        --socket-id 0  \
        --data-location $BERT_LARGE_DIR/wwm_uncased_L-24_H-1024_A-16 \
        --checkpoint $BERT_LARGE_DIR/bert_large_checkpoints \
        --output-dir $BERT_LARGE_OUTPUT/bert-squad-output \
        --verbose \
        -- infer_option=SQuAD \
           DEBIAN_FRONTEND=noninteractive \
           predict_file=$SQUAD_DIR/dev-v1.1.json \
           experimental-gelu=False \
           init_checkpoint=model.ckpt-3649 >> socket0-inference-log &

       python3 launch_benchmark.py \
        --model-name=bert_large \
        --precision=fp32 \
        --mode=inference \
        --framework=tensorflow \
        --batch-size=32 \
        --socket-id 1 \
        --data-location $BERT_LARGE_DIR/wwm_uncased_L-24_H-1024_A-16 \
        --checkpoint $BERT_LARGE_DIR/bert_large_checkpoints \
        --output-dir $BERT_LARGE_OUTPUT/bert-squad-output \
        --verbose \
        -- infer_option=SQuAD \
           DEBIAN_FRONTEND=noninteractive \
           predict_file=$SQUAD_DIR/dev-v1.1.json \
           experimental-gelu=False \
           init_checkpoint=model.ckpt-3649
    }

    if [ "$numa_nodes" = "1" ];then
            run_inference_without_numabind
    else
            run_inference_with_numabind
    fi""")
    
  os.chmod(inference, 555)
  os.system("ps -Af | grep launch_benchmark.py | grep -v grep | awk '{print $2}' | xargs kill -9")
  os.system("ps -Af | grep run_squad.py | grep -v grep | awk '{print $2}' | xargs kill -9")
  p = subprocess.Popen([inference], stdin=None, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
  directory_to_second_numa_info = Path("/sys/devices/system/node/node1")

  
  if  directory_to_second_numa_info.exists():
    # 2 NUMA nodes
    for line in iter(p.stdout.readline, ''):
      if b'Reading package lists...' in line or b'INFO:tensorflow:tokens' in line or b'INFO:tensorflow:  name = bert' in line:
        print("\t\t\t\t  Preparing data ......", end='\r')
      if b"INFO:tensorflow:examples/sec" in line:
        print("\t\t\t\t  Inference started, current real-time throughput (examples/sec) : " + str(float(str(line).strip("\\n'").split(' ')[1])*2), end='\r')
      if b"throughput((num_processed_examples-threshod_examples)/Elapsedtime)" in line:
        print("\t\t\t\t  Inference finished, overall inference throughput (examples/sec) : " + str(float(str(line).strip("\\n'").split(':')[1])*2), end='\r')
      if line == b'' and p.poll() != None:
        break
  else:
    # 1 NUMA node
    for line in iter(p.stdout.readline, ''):
      if b'Reading package lists...' in line or b'INFO:tensorflow:tokens' in line or b'INFO:tensorflow:  name = bert' in line:
        print("\t\t\t\t  Preparing data ......", end='\r')
      if b"INFO:tensorflow:examples/sec" in line:
        print("\t\t\t\t  Inference started, current real-time throughput (examples/sec) : " + str(line).strip("\\n'").split(' ')[1], end='\r')
      if b"throughput((num_processed_examples-threshod_examples)/Elapsedtime)" in line:
        print("\t\t\t\t  Inference finished, overall inference throughput (examples/sec) : " + str(line).strip("\\n'").split(':')[1], end='\r')
      if line == b'' and p.poll() != None:
        break
       
  p.stdout.close()
  
run_inference()

## Check whether is Intel-optimized TensorFlow

This is a simple auxiliary script tool to check whether the installed TensorFlow is Intel-optimized TensorFlow. "Ture" represents Intel-optimized TensorFlow.

In [None]:
# Print version, and check whether is Intel-optimized
import tensorflow
print("tensorflow version: " + tensorflow.__version__)

from packaging import version
if (version.parse("2.5.0") <= version.parse(tensorflow.__version__)):
  from tensorflow.python.util import _pywrap_util_port
  print( _pywrap_util_port.IsMklEnabled())
else:
  from tensorflow.python import _pywrap_util_port
  print(_pywrap_util_port.IsMklEnabled())