<a href="https://colab.research.google.com/github/kaka-lin/ML-Notes/blob/master/Model%20optimization/Post%20Training%20Quantization/Dynamic%20Range%20Quantization/example/dynamic_range_quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Post Training Quantization - full integer quantization

此範例使用`MNIST`，結果如下:

```bash
Model Size:
{'baselin model (TF2)': 11149205,
 'non quantized tflite': 11084920,
 'ptq dynamic tflite': 2774512,
 'ptq float fullback tflite': 2775032,
 'ptq int only tflite': 2775056}
Model Accuracy:
{'baseline model': 0.9815,
 'non quantized tflite': 0.9815,
 'ptq dynamic tflite': 0.9815,
 'ptq float fullback tflite': 0.9812,
 'ptq int only tflite': 0.9814}
```

詳細步驟如下!

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import pprint
pp = pprint.PrettyPrinter(depth=4)

from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Flatten, Conv2D

%matplotlib inline

print("TensorFlow version: ", tf.__version__)

TensorFlow version:  2.9.1


## Generate a TensorFlow Model

包含 Load MNIST dataset, Build the model, Training and Testing

### Setup

定義會用到的相關函式，包含 `load data`, `build, train and test model`。
如果不想看詳細可以直接跳至 [Starting Generate model
](#Starting-Generate-model)

In [2]:
def load_data():
    mnist_dataset = tf.keras.datasets.mnist.load_data()
    (x_train, y_train), (x_test, y_test) = mnist_dataset
    
    # Normalize the input image so that each pixel value is between 0 to 1.
    x_train = x_train.astype(np.float32) / 255.0
    x_test = x_test.astype(np.float32) / 255.0

    # Add a channels dimension
    x_train = x_train[..., tf.newaxis]
    x_test = x_test[..., tf.newaxis]
    
    return (x_train, y_train), (x_test, y_test)


def calculate_tf2_model_size(model_name):
    """Calculate the model size of TensorFlow SaveModel format"""
    size = 0
    
    # Tensorflow SaveModel format is a folder
    for path, dirs, files in os.walk(model_name):
        for f in files:
            fp = os.path.join(path, f)
            size += os.path.getsize(fp)
    return size

In [3]:
class MyModel(tf.keras.Model):
    def __init__(self):
        super(MyModel, self).__init__()
        
        # Define your layer here
        self.conv1 = Conv2D(32, 3, activation='relu')
        self.flatten = Flatten()
        self.dense1 = Dense(128, activation='relu')
        self.dense2 = Dense(10, activation='softmax')
    
    def call(self, x):
        # Define your forward pass here
        x = self.conv1(x)
        x = self.flatten(x)
        x = self.dense1(x)
        x = self.dense2(x)
        return x

@tf.function
def train_step(model, x_batch, y_batch, optimizer, loss_fn,
               train_loss, train_accuracy):
    with tf.GradientTape() as tape:
        predictions = model(x_batch, training=True)
        loss = loss_fn(y_batch, predictions)

    # backward
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    
    # Update training metric after batch
    train_loss.update_state(loss)
    train_accuracy.update_state(y_batch, predictions)


@tf.function
def test_step(model, x_batch, y_batch, loss_fn,
              test_loss, test_accuracy):
    # forward
    predictions = model(x_batch, training=False)
    t_loss = loss_fn(y_batch, predictions)

    test_loss.update_state(t_loss)
    test_accuracy.update_state(y_batch, predictions)

### Build the model

注意: 這邊 optimizer 只能實例化一次，否則會有 error 如下:

```bash
ValueError: tf.function only supports singleton tf.Variables created on the first call. Make sure the tf.Variable is only created once or created outside tf.function. See https://www.tensorflow.org/guide/function#creating_tfvariables for more information.
```

參考: [Using with multiple Keras optimizer](https://www.tensorflow.org/guide/function#using_with_multiple_keras_optimizers)

In [4]:
# Load MNIST dataset
(x_train, y_train), (x_test, y_test) = load_data()
train_images, train_labels = x_train, y_train
test_images, test_labels = x_test, y_test

# Preprocessing dataset
# Use `tf.data`` to batch and shuffle the dataset:
train_dataset = tf.data.Dataset.from_tensor_slices(
    (x_train, y_train)).shuffle(10000).batch(32)

test_dataset = tf.data.Dataset.from_tensor_slices(
    (x_test, y_test)).batch(32)

# Build the model
model = MyModel()

# If we want to using model.summary()
model.build(input_shape=(None, 28, 28, 1))
model.call(Input(shape=(28, 28, 1)))
# model.summary()

# Compile the model: optimizer and loss
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam()

train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
test_loss = tf.keras.metrics.Mean(name='test_loss')
test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='test_accuracy')

# 建立評估模型的dict
MODEL_SIZE = {}
MODEL_ACCURACY = {}

### Starting Generate model

This training won't take long because you're training the model for just a 3 epochs, which trains to about ~98% accuracy.

In [5]:
#########################################################################################################
# Strating training
EPOCHS = 3
n_batches = len(train_dataset)
for epoch in range(EPOCHS):
    print(f'Epoch {epoch+1}/{EPOCHS}')
    with tqdm(train_dataset, total=n_batches,
              bar_format='{desc:<5.5}{percentage:3.0f}%|{bar:30}{r_bar}') as pbar:
        for batch, (x_train, y_train) in enumerate(pbar):
            train_step(model, x_train, y_train, optimizer, loss_fn,
                       train_loss, train_accuracy)
            pbar.set_postfix({
                'loss': train_loss.result().numpy(),
                'accuracy': train_accuracy.result().numpy()})

    # Reset the metrics at the start of the next epoch
    train_loss.reset_states()
    train_accuracy.reset_states()

# save model
model.save('model_data/baseline_model', include_optimizer=False)

#########################################################################################################
# Testing
n_batches = len(test_dataset)
print(f'Testing:')
with tqdm(test_dataset, total=n_batches,
          bar_format='{desc:<5.5}{percentage:3.0f}%|{bar:30}{r_bar}') as pbar:
    for batch, (x_test, y_test) in enumerate(pbar):
        test_step(model, x_test, y_test, loss_fn, test_loss, test_accuracy)
        pbar.set_postfix({
            'loss': test_loss.result().numpy(),
            'accuracy': test_accuracy.result().numpy()})
    
test_acc = test_accuracy.result().numpy()
    
# Reset the metrics at the start of the next epoch
test_loss.reset_states()
test_accuracy.reset_states()
print(f"Test accuracy: {test_acc}")

#########################################################################################################
# Record model size and accuracy
MODEL_SIZE['baselin model (TF2)'] = calculate_tf2_model_size('model_data/baseline_model') # / 1000000 MB
MODEL_ACCURACY['baseline model'] = test_acc

print("Model Size:")
pp.pprint(MODEL_SIZE)
print("Model Accuracy:")
pp.pprint(MODEL_ACCURACY)

Epoch 1/3


     100%|██████████████████████████████| 1875/1875 [00:05<00:00, 351.98it/s, loss=0.134, accuracy=0.96] 


Epoch 2/3


     100%|██████████████████████████████| 1875/1875 [00:03<00:00, 478.04it/s, loss=0.0434, accuracy=0.987]


Epoch 3/3


     100%|██████████████████████████████| 1875/1875 [00:03<00:00, 478.02it/s, loss=0.0218, accuracy=0.993]


INFO:tensorflow:Assets written to: model_data/baseline_model/assets


INFO:tensorflow:Assets written to: model_data/baseline_model/assets


Testing:


     100%|██████████████████████████████| 313/313 [00:00<00:00, 464.71it/s, loss=0.0618, accuracy=0.982]


Test accuracy: 0.9815000295639038
Model Size:
{'baselin model (TF2)': 11149205}
Model Accuracy:
{'baseline model': 0.9815}


## Convert to a TFLite model

請注意，某些版本的量化會將某些數據保留為浮點格式。 因此，以下部分將顯示每個選項的量化量增加，直到我們得到一個完全由 `int8` 或 `uint8` 數據組成的模型。

### 1. No quantization

In [6]:
# Convert the model
converter = tf.lite.TFLiteConverter.from_saved_model('model_data/baseline_model')                                          
tflite_model = converter.convert()

# Save the model.
with open('model_data/model_no_quant.tflite', 'wb') as f:
    f.write(tflite_model)

It's now a TensorFlow Lite model, but it's still using 32-bit float values for all parameter data.

### 2. Dynamic range quantization

In [7]:
# Convert the model: Dynamic range quantization
converter = tf.lite.TFLiteConverter.from_saved_model('model_data/baseline_model')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

# Save the model.
with open('model_data/model_ptq_dynamic.tflite', 'wb') as f:
    f.write(tflite_quant_model)

The model is now a bit smaller with quantized weights, but other variable data is still in float format.

### 3. Integer with float fallback quantization (using default float input/output)

To quantize the variable data (such as model `input/output` and `intermediates between layers`), you need to provide a `RepresentativeDataset`. 

###### RepresentativeDataset

This is a generator function that provides a set of input data that's large enough to represent typical values. It allows the converter to estimate a dynamic range for all the variable data. (The dataset does not need to be unique compared to the training or evaluation dataset.) To support multiple inputs, each representative data point is a list and elements in the list are fed to the model according to their indices.

In [8]:
def representative_data_gen():    
    for input_value in tf.data.Dataset.from_tensor_slices(train_images).batch(1).take(100):
        # Model has only one input so each data point has one element.
        yield [input_value]

# Convert the model: Integer with float fallback quantization
converter = tf.lite.TFLiteConverter.from_saved_model('model_data/baseline_model')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
tflite_quant_model = converter.convert()

# Save the model.
with open('model_data/model_ptq_int_float_fullback.tflite', 'wb') as f:
    f.write(tflite_quant_model)

fully_quantize: 0, inference_type: 6, input_inference_type: 0, output_inference_type: 0


Now all weights and variable data are quantized, and the model is significantly smaller compared to the original TensorFlow Lite model.

However, to maintain compatibility with applications that traditionally use float model input and output tensors, the TensorFlow Lite Converter leaves the model input and output tensors in float:

In [9]:
# Load the model into the interpreter
interpreter = tf.lite.Interpreter(model_content=tflite_quant_model)
input_type = interpreter.get_input_details()[0]['dtype']
print('input: ', input_type)
output_type = interpreter.get_output_details()[0]['dtype']
print('output: ', output_type)

input:  <class 'numpy.float32'>
output:  <class 'numpy.float32'>


此模式通常有利於兼容性，但它與僅執行 integer-baesd 的操作設備不兼容，例如 `Edge TPU`。

Additionally, the above process may leave an operation in float format if TensorFlow Lite doesn't include a quantized implementation for that operation. This strategy allows conversion to complete so you have a smaller and more efficient model, but again, it won't be compatible with integer-only hardware. (All ops in this MNIST model have a quantized implementation.)

So to ensure an end-to-end integer-only model, you need a couple more parameters...

### 4. Integer-only quantization

To quantize the input and output tensors, and make the converter throw an error if it encounters an operation it cannot quantize, convert the model again with some additional parameters:

In [10]:
def representative_data_gen():    
    for input_value in tf.data.Dataset.from_tensor_slices(train_images).batch(1).take(100):
        # Model has only one input so each data point has one element.
        yield [input_value]

# Convert the model: Integer-only quantization
converter = tf.lite.TFLiteConverter.from_saved_model('model_data/baseline_model')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
# Ensure that if any ops can't be quantized, the converter throws an error
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Set the input and output tensors to uint8 (APIs added in r2.3)
converter.inference_input_type = tf.uint8  # or tf.int8
converter.inference_output_type = tf.uint8  # or tf.int8
tflite_quant_model = converter.convert()

# Save the model.
with open('model_data/model_ptq_int_only.tflite', 'wb') as f:
    f.write(tflite_quant_model)

fully_quantize: 0, inference_type: 6, input_inference_type: 3, output_inference_type: 3


The internal quantization remains the same as above, but you can see the input and output tensors are now integer format:

In [11]:
# Load the model into the interpreter
interpreter = tf.lite.Interpreter(model_content=tflite_quant_model)
input_type = interpreter.get_input_details()[0]['dtype']
print('input: ', input_type)
output_type = interpreter.get_output_details()[0]['dtype']
print('output: ', output_type)

input:  <class 'numpy.uint8'>
output:  <class 'numpy.uint8'>


Now you have an integer quantized model that uses integer data for the model's input and output tensors, so it's compatible with integer-only hardware such as the Edge TPU.

## Run the TFLite model

### Helper function for running tflite model

先建立 TFLite 的評估模型準確率的函式。參考[官方範例](https://www.tensorflow.org/lite/performance/post_training_integer_quant_16x8#evaluate_the_models)。

In [12]:
# Helper function to run inference on a TFLite model
def evaluate_model(tflite_file):
    # Initialize the interpreter
    interpreter = tf.lite.Interpreter(model_path=str(tflite_file))
    interpreter.allocate_tensors()
    
    input_details = interpreter.get_input_details()[0]
    output_details = interpreter.get_output_details()[0]
    
    # Run predictions on every image in the "test" dataset.
    prediction_digits = []
    for test_image in test_images:
        
        # Check if the input type is quantized, then rescale input data to uint8
        if input_details['dtype'] == np.uint8:
            input_scale, input_zero_point = input_details["quantization"]
            test_image = test_image / input_scale + input_zero_point

        # Pre-processing: add batch dimension and convert to float32 to match with
        # the model's input data format.
        test_image = np.expand_dims(test_image, axis=0).astype(input_details["dtype"])
        interpreter.set_tensor(input_details["index"], test_image)
        
        # Run inference.
        interpreter.invoke()
        
        # Post-processing: remove batch dimension and find the digit with highest
        # probability.
        output = interpreter.tensor(output_details["index"])
        digit = np.argmax(output()[0])
        prediction_digits.append(digit)
    
    # Compare prediction results with ground truth labels to calculate accuracy.
    accurate_count = 0
    for index in range(len(prediction_digits)):
        if prediction_digits[index] == test_labels[index]:
            accurate_count += 1
    accuracy = accurate_count * 1.0 / len(prediction_digits)

    return accuracy

In [13]:
# evaluate the TF Lite model
no_quant_acc = evaluate_model('model_data/model_no_quant.tflite')
ptq_dynamic_acc = evaluate_model('model_data/model_ptq_dynamic.tflite')
ptq_int_float_fullback_acc = evaluate_model('model_data/model_ptq_int_float_fullback.tflite')
ptq_int_only_acc = evaluate_model('model_data/model_ptq_int_only.tflite')

# Record model size and accuracy
MODEL_SIZE['non quantized tflite'] = os.path.getsize('model_data/model_no_quant.tflite') # / 1000000 MB
MODEL_ACCURACY['non quantized tflite'] = no_quant_acc

MODEL_SIZE['ptq dynamic tflite'] = os.path.getsize('model_data/model_ptq_dynamic.tflite') # / 1000000 MB
MODEL_ACCURACY['ptq dynamic tflite'] = ptq_dynamic_acc

MODEL_SIZE['ptq float fullback tflite'] = os.path.getsize('model_data/model_ptq_int_float_fullback.tflite') # / 1000000 MB
MODEL_ACCURACY['ptq float fullback tflite'] = ptq_int_float_fullback_acc

MODEL_SIZE['ptq int only tflite'] = os.path.getsize('model_data/model_ptq_int_only.tflite') # / 1000000 MB
MODEL_ACCURACY['ptq int only tflite'] = ptq_int_only_acc

INFO: Created TensorFlow Lite XNNPACK delegate for CPU.


## Results

In [14]:
print("Model Size:")
pp.pprint(MODEL_SIZE)
print("Model Accuracy:")
pp.pprint(MODEL_ACCURACY)

Model Size:
{'baselin model (TF2)': 11149205,
 'non quantized tflite': 11084920,
 'ptq dynamic tflite': 2774512,
 'ptq float fullback tflite': 2775032,
 'ptq int only tflite': 2775056}
Model Accuracy:
{'baseline model': 0.9815,
 'non quantized tflite': 0.9815,
 'ptq dynamic tflite': 0.9815,
 'ptq float fullback tflite': 0.9812,
 'ptq int only tflite': 0.9814}
