# Quantize OpenVINO&trade; Model from FP32 to INT8

In this notebook, we'll demonstrate how to convert an OpenVINO&trade; model from FP32 to INT8 precision using the [post-training optimization toolkit API](https://docs.openvinotoolkit.org/latest/pot_compression_api_README.html). 

We'll assume that you already ran `train.py` and have trained a TensorFlow 3D U-Net model on the BraTS Medical Decathlon dataset.  We'll further assume that you have converted the final TensorFlow 3D U-Net model to OpenVINO&trade; by running something like:

```
source /opt/intel/openvino/bin/setupvars.sh
python $INTEL_OPENVINO_DIR/deployment_tools/model_optimizer/mo_tf.py \
       --saved_model_dir 3d_unet_decathlon_final \
       --model_name 3d_unet_decathlon \
       --batch 1  \
       --output_dir openvino_models/FP32 \
       --data_type FP32
```

## Import the OpenVINO&trade; Python API

In [None]:
from openvino.inference_engine import IECore

## Import some other Python libraries

In [None]:
import numpy as np
import time
import os

import matplotlib.pyplot as plt
%matplotlib inline

## Load the dataset

Note: We'll reuse the same data loader we used in training. Nevertheless, all we need to do is to provide the 3D MRI scan with the same preprocessing (normalization, cropping, etc) as a NumPy array.

In [None]:
from dataloader import DatasetGenerator
import settings

crop_dim = (settings.TILE_HEIGHT, settings.TILE_WIDTH,
            settings.TILE_DEPTH, settings.NUMBER_INPUT_CHANNELS)

settings.BATCH_SIZE = 1 

brats_data = DatasetGenerator(crop_dim=crop_dim,
                              data_path=settings.DATA_PATH,
                              batch_size=settings.BATCH_SIZE,
                              train_test_split=settings.TRAIN_TEST_SPLIT,
                              validate_test_split=settings.VALIDATE_TEST_SPLIT,
                              number_output_classes=settings.NUMBER_OUTPUT_CLASSES,
                              random_seed=settings.RANDOM_SEED)

brats_data.print_info()  # Print dataset information

## OpenVINO&trade; Post-training Optimization Toolkit Imports

This are the additional libraries we need for the OpenVINO&trade; Post-training optimization toolkit.

In [None]:
from addict import Dict

from compression.api import Metric, DataLoader
from compression.engines.ie_engine import IEEngine
from compression.graph import load_model, save_model
from compression.graph.model_utils import compress_model_weights
from compression.pipeline.initializer import create_pipeline
from compression.utils.logger import init_logger


## Pretty print

Let's just define some pretty colors to print text. 

In [None]:
class bcolors:
    """
    Just gives us some colors for the text
    """
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'


## Post-training Optimization Toolkit API Settings

Here we define some of the settings for the calibration tool. Most of this just defines where to find the OpenVINO&trade; IR model files (`.xml`,`.bin`). It also defines the dataset configuration (which will be passed to the data loader defined in the cell below).

In [None]:
import settings

openvino_modelname=settings.SAVED_MODEL_NAME
openvino_directory="openvino_models"

int8_directory=os.path.join(openvino_directory, "INT8")
maximum_metric_drop = 0.05  # For accuracy-aware training. this defines how much the metric is allowed to change.
openvino_path = os.path.join(openvino_directory, "FP32", openvino_modelname)
accuracy_aware_quantization=False

path_to_xml_file = "{}.xml".format(openvino_path)
path_to_bin_file = "{}.bin".format(openvino_path)

dataset_config = {
    "num_samples": 40,   # Get 40 samples
    "test_dataset": brats_data.get_test()   # Pass our TensorFlow data loader to the API
}

model_config = Dict({
    "model_name": openvino_modelname,
    "model": path_to_xml_file,
    "weights": path_to_bin_file
})

## Data Loader for OpenVINO&trade; Post-Training Optimization Toolkit API

You can define this data loader to work with your custom dataset. In our case, we've already defined a TensorFlow `tf.data` object. We'll just pass that to the API's data loader and transpose the images and masks (OpenVINO&trade; assumes the data is channels first-- NCHWD)

In [None]:
class MyDataLoader(DataLoader):

    def __init__(self, config):

        super().__init__(config)

        """
        You can define this data loader to work with your custom dataset.
        In our case, we've already defined a TensorFlow `tf.data` object.
        We'll just pass that to the API's data loader and transpose the images and masks
        (OpenVINO assumes the data is channels first-- NCHWD)
        """

        self.items = np.arange(config["num_samples"])  # Just pass in how many samples you want to take
        self.dataset = config["test_dataset"]

        print(bcolors.UNDERLINE + "\nQuantizing FP32 OpenVINO model to INT8\n" + bcolors.ENDC)

        print(bcolors.OKBLUE + "Taking {:,} random samples from the test dataset ".format(len(self.items)) + \
            bcolors.ENDC)

        self.batch_size = 1

    def set_subset(self, indices):
        self._subset = None

    @property
    def batch_num(self):
        return ceil(self.size / self.batch_size)

    @property
    def size(self):
        return self.items.shape[0]

    def __len__(self):
        return self.size

    def __getitem__(self, item):
        """
        """
        ds = self.dataset.take(1).as_numpy_iterator()  # Grab the next batch and take a single element (image/mask)
        for img, msk in ds:
            img = np.transpose(img, [0,4,1,2,3])  # OpenVINO expects the input to be channels first (NCHWD)
            msk = np.transpose(msk, [0,4,1,2,3])  # OpenVINO expects the label/output to be channels first (NCHWD)
        
        return (item, msk), img

## Metric for OpenVINO&trade; Post-Training Optimization Toolkit API

Here we need to define the metric we will use to evaluate the OpenVINO&trade; model to determine how much it changes when the weights are converted from FP32 to INT8. In our case, the U-Net model uses the Dice coefficient to determine the accuracy of the model.

In [None]:
class MyMetric(Metric):
    def __init__(self):
        super().__init__()
        self.name = "custom Metric - Dice score"
        self._values = []
        self.round = 1

    @property
    def value(self):
        """ Returns accuracy metric value for the last model output. """
        return {self.name: [self._values[-1]]}

    @property
    def avg_value(self):
        """ Returns accuracy metric value for all model outputs. """
        value = np.ravel(self._values).mean()
        print("Round #{}    Mean {} = {}".format(self.round, self.name, value))

        self.round += 1

        return {self.name: value}

    def update(self, outputs, labels):
        """ Updates prediction matches.
        Args:
            outputs: model output
            labels: annotations
        Put your post-processing code here.
        Put your custom metric code here.
        The metric gets appended to the list of metric values
        """

        def dice_score(pred, truth):
            """
            Sorensen Dice score
            Measure of the overlap between the prediction and ground truth masks
            """
            numerator = np.sum(np.round(pred) * truth) * 2.0
            denominator = np.sum(np.round(pred)) + np.sum(truth)

            return numerator / denominator


        metric = dice_score(labels[0], outputs[0])
        self._values.append(metric)

    def reset(self):
        """ Resets collected matches """
        self._values = []

    @property
    def higher_better(self):
        """Attribute whether the metric should be increased"""
        return True

    def get_attributes(self):
        return {self.name: {"direction": "higher-better", "type": ""}}

## Additional Calibration Settings

In [None]:
engine_config = Dict({
    "device": "CPU",
    "stat_requests_number": 4,
    "eval_requests_number": 4
})

default_quantization_algorithm = [
    {
        "name": "DefaultQuantization",
        "params": {
            "target_device": "CPU",
            "preset": "performance",
            #"stat_subset_size": 10
        }
    }
]

accuracy_aware_quantization_algorithm = [
    {
        "name": "AccuracyAwareQuantization", # compression algorithm name
        "params": {
            "target_device": "CPU",
            "preset": "performance",
            "stat_subset_size": 10,
            "metric_subset_ratio": 0.5, # A part of the validation set that is used to compare full-precision and quantized models
            "ranking_subset_size": 300, # A size of a subset which is used to rank layers by their contribution to the accuracy drop
            "max_iter_num": 10,    # Maximum number of iterations of the algorithm (maximum of layers that may be reverted back to full-precision)
            "maximal_drop": maximum_metric_drop,      # Maximum metric drop which has to be achieved after the quantization
            "drop_type": "absolute",    # Drop type of the accuracy metric: relative or absolute (default)
            "use_prev_if_drop_increase": True,     # Whether to use NN snapshot from the previous algorithm iteration in case if drop increases
            "base_algorithm": "DefaultQuantization" # Base algorithm that is used to quantize model at the beginning
        }
    }
]

class GraphAttrs(object):
    def __init__(self):
        self.keep_quantize_ops_in_IR = True
        self.keep_shape_ops = False
        self.data_type = "FP32"
        self.progress = False
        self.generate_experimental_IR_V10 = True
        self.blobs_as_inputs = True
        self.generate_deprecated_IR_V7 = False

## Running the Post-training optimization Toolkit

Once we have the data loader and metric classes defined, we can pass them into the API pipeline. This is calculate the scale (calibration) for the FP32 weights so that they can be converted to INT8.

In [None]:
model = load_model(model_config)

data_loader = MyDataLoader(dataset_config)
metric = MyMetric()


engine = IEEngine(engine_config, data_loader, metric)

if accuracy_aware_quantization:
    # https://docs.openvinotoolkit.org/latest/_compression_algorithms_quantization_accuracy_aware_README.html
    print(bcolors.BOLD + "Accuracy-aware quantization method" + bcolors.ENDC)
    pipeline = create_pipeline(accuracy_aware_quantization_algorithm, engine)
else:
    print(bcolors.BOLD + "Default quantization method" + bcolors.ENDC)
    pipeline = create_pipeline(default_quantization_algorithm, engine)


metric_results_FP32 = pipeline.evaluate(model)

compressed_model = pipeline.run(model)
save_model(compressed_model, int8_directory)

metric_results_INT8 = pipeline.evaluate(compressed_model)

print(bcolors.BOLD + "\nFINAL RESULTS" + bcolors.ENDC)

# print metric value
if metric_results_FP32:
    for name, value in metric_results_FP32.items():
        print(bcolors.OKGREEN + "{: <27s} FP32: {}".format(name, value) + bcolors.ENDC)

if metric_results_INT8:
    for name, value in metric_results_INT8.items():
        print(bcolors.OKBLUE + "{: <27s} INT8: {}".format(name, value) + bcolors.ENDC)


print(bcolors.BOLD + "\nThe INT8 version of the model has been saved to the directory ".format(int8_directory) + \
    bcolors.HEADER + "{}\n".format(int8_directory) + bcolors.ENDC)

## Load the FP32 OpenVINO&trade; model

In [None]:
path_to_xml_file = "{}.xml".format(openvino_path)
path_to_bin_file = "{}.bin".format(openvino_path)

ie = IECore()
net = ie.read_network(model=path_to_xml_file, weights=path_to_bin_file)


## Load the FP32 OpenVINO&trade; model to the hardware device

In this case our device is `CPU`. We could also use `MYRIAD` for the Intel&reg; NCS2&trade; VPU or `GPU` for the Intel&reg; GPU.

In [None]:
input_layer_name = next(iter(net.input_info))
output_layer_name = next(iter(net.outputs))
print("Input layer name = {}\nOutput layer name = {}".format(input_layer_name, output_layer_name))

exec_net = ie.load_network(network=net, device_name="CPU", num_requests=1)


## Load the INT8 OpenVINO&trade; model to the hardware device

In [None]:
openvino_filename_int8 = os.path.join(int8_directory, openvino_modelname)
path_to_xml_file_int8 = "{}.xml".format(openvino_filename_int8)
path_to_bin_file_int8 = "{}.bin".format(openvino_filename_int8)

ie_int8 = IECore()
net_int8 = ie_int8.read_network(model=path_to_xml_file_int8, weights=path_to_bin_file_int8)

input_layer_name = next(iter(net_int8.input_info))
output_layer_name = next(iter(net_int8.outputs))
print("Input layer name = {}\nOutput layer name = {}".format(input_layer_name, output_layer_name))

exec_net_int8 = ie_int8.load_network(network=net_int8, device_name="CPU", num_requests=1)


## Load the final TensorFlow model

In [None]:
import tensorflow as tf

tf_model = tf.keras.models.load_model("{}_final".format(settings.SAVED_MODEL_NAME), compile=False)
tf_model.compile(optimizer="adam", loss="binary_crossentropy")

In [None]:
def test_intel_tensorflow():
    """
    Check if Intel version of TensorFlow is installed
    """
    import tensorflow as tf

    print("We are using Tensorflow version {}".format(tf.__version__))

    major_version = int(tf.__version__.split(".")[0])
    if major_version >= 2:
        from tensorflow.python import _pywrap_util_port
        print("Intel-optimizations (DNNL) enabled:",
              _pywrap_util_port.IsMklEnabled())
    else:
        print("Intel-optimizations (DNNL) enabled:",
              tf.pywrap_tensorflow.IsMklEnabled())


test_intel_tensorflow()  # Prints if Intel-optimized TensorFlow is used.

## Calculate the Dice coefficient

This measures the performance of the model from 0 to 1 where 1 means the model gives a perfect prediction.

In [None]:
def calc_dice(target, prediction, smooth=0.0001):
    """
    Sorenson Dice
    """
    prediction = np.round(prediction)

    numerator = 2.0 * np.sum(target * prediction) + smooth
    denominator = np.sum(target) + np.sum(prediction) + smooth
    coef = numerator / denominator

    return coef

## Plot the predictions for both OpenVINO&trade; and TensorFlow

We'll also time the inference to compare.

In [None]:
def plot_predictions(img, msk):
    
    slicenum=np.argmax(np.sum(msk, axis=(1,2)))  # Find the slice with the largest tumor section

    plt.figure(figsize=(20,20))

    plt.subplot(1,4,1)
    plt.title("MRI", fontsize=20)
    plt.imshow(img[0,:,:,slicenum,0], cmap="bone")
    plt.subplot(1,4,2)
    plt.imshow(msk[0,:,:,slicenum,0], cmap="bone")
    plt.title("Ground truth", fontsize=20)

    """
    OpenVINO Model Prediction
    Note: OpenVINO assumes the input (and output) are organized as channels first (NCHWD)
    whereas TensorFlow assumes channels last (NHWDC). We'll use the NumPy transpose
    to change the order.
    """
    start_time = time.time()
    res = exec_net.infer({input_layer_name: np.transpose(img, [0,4,1,2,3])})
    prediction_ov = np.transpose(res[output_layer_name], [0,2,3,4,1])    
    print("OpenVINO inference time = {:.4f} msecs".format(1000.0*(time.time()-start_time)))

    plt.subplot(1,4,3)
    dice_coef_ov = calc_dice(msk,prediction_ov)
    plt.imshow(prediction_ov[0,:,:,slicenum,0], cmap="bone")
    plt.title("OpenVINO Prediction\nDice = {:.4f}".format(dice_coef_ov), fontsize=20)
    
    
    """
    TensorFlow Model Prediction
    """
    start_time = time.time()
    prediction_tf = tf_model.predict(img)
    print("TensorFlow inference time = {:.4f} msecs".format(1000.0*(time.time()-start_time)))
    
    plt.subplot(1,4,4)
    dice_coef_tf = calc_dice(msk,prediction_tf)
    plt.imshow(prediction_tf[0,:,:,slicenum,0], cmap="bone")
    plt.title("TensorFlow Prediction\nDice = {:.4f}".format(dice_coef_tf), fontsize=20)

## Inference time - TensorFlow versus FP32 OpenVINO&trade; 

Let's grab some data, perform inference, and plot the results.

In [None]:
ds = brats_data.get_test().take(1).as_numpy_iterator()
for img, msk in ds:
    plot_predictions(img,msk)

In [None]:
ds = brats_data.get_test().take(1).as_numpy_iterator()
for img, msk in ds:
    plot_predictions(img,msk)

In [None]:
ds = brats_data.get_test().take(1).as_numpy_iterator()
for img, msk in ds:
    plot_predictions(img,msk)

In [None]:
ds = brats_data.get_test().take(1).as_numpy_iterator()
for img, msk in ds:
    plot_predictions(img,msk)

## Inference time - FP32 OpenVINO&trade; versus INT8 OpenVINO&trade;

Let's grab some data, perform inference, and plot the results.

In [None]:
def plot_predictions_int8_fp32(img, msk):
    
    slicenum=np.argmax(np.sum(msk, axis=(1,2)))  # Find the slice with the largest tumor section

    plt.figure(figsize=(20,20))

    plt.subplot(1,4,1)
    plt.title("MRI", fontsize=20)
    plt.imshow(img[0,:,:,slicenum,0], cmap="bone")
    plt.subplot(1,4,2)
    plt.imshow(msk[0,:,:,slicenum,0], cmap="bone")
    plt.title("Ground truth", fontsize=20)

    """
    OpenVINO Model Prediction - FP32
    Note: OpenVINO assumes the input (and output) are organized as channels first (NCHWD)
    whereas TensorFlow assumes channels last (NHWDC). We'll use the NumPy transpose
    to change the order.
    """
    start_time = time.time()
    res = exec_net.infer({input_layer_name: np.transpose(img, [0,4,1,2,3])})
    prediction_ov = np.transpose(res[output_layer_name], [0,2,3,4,1])    
    print("OpenVINO inference time FP32 = {:.4f} msecs".format(1000.0*(time.time()-start_time)))

    plt.subplot(1,4,3)
    dice_coef_ov = calc_dice(msk,prediction_ov)
    plt.imshow(prediction_ov[0,:,:,slicenum,0], cmap="bone")
    plt.title("OpenVINO FP32 Prediction\nDice = {:.4f}".format(dice_coef_ov), fontsize=20)
    
    
    """
    OpenVINO Model Prediction - INT8
    Note: OpenVINO assumes the input (and output) are organized as channels first (NCHWD)
    whereas TensorFlow assumes channels last (NHWDC). We'll use the NumPy transpose
    to change the order.
    """
    start_time = time.time()
    res_int8 = exec_net_int8.infer({input_layer_name: np.transpose(img, [0,4,1,2,3])})
    prediction_ov_int8 = np.transpose(res_int8[output_layer_name], [0,2,3,4,1])    
    print("OpenVINO inference time INT8 = {:.4f} msecs".format(1000.0*(time.time()-start_time)))

    plt.subplot(1,4,4)
    dice_coef_ov_int8 = calc_dice(msk,prediction_ov_int8)
    plt.imshow(prediction_ov_int8[0,:,:,slicenum,0], cmap="bone")
    plt.title("OpenVINO INT8 Prediction\nDice = {:.4f}".format(dice_coef_ov_int8), fontsize=20)

In [None]:
ds = brats_data.get_test().take(1).as_numpy_iterator()
for img, msk in ds:
    plot_predictions_int8_fp32(img,msk)

In [None]:
ds = brats_data.get_test().take(1).as_numpy_iterator()
for img, msk in ds:
    plot_predictions_int8_fp32(img, msk)

In [None]:
ds = brats_data.get_test().take(1).as_numpy_iterator()
for img, msk in ds:
    plot_predictions_int8_fp32(img, msk)

In [None]:
ds = brats_data.get_test().take(1).as_numpy_iterator()
for img, msk in ds:
    plot_predictions_int8_fp32(img, msk)

In [None]:
ds = brats_data.get_test().take(1).as_numpy_iterator()
for img, msk in ds:
    plot_predictions_int8_fp32(img, msk)

*Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. SPDX-License-Identifier: EPL-2.0*

*Copyright (c) 2019-2020 Intel Corporation*