## Overview

Quantization is an optimization strategy that converts 32-bit floating-point numbers (such as weights and activation outputs) to lower precision numbers (int8, f16, etc.). This results in a smaller model and increased inferencing speed, which is valuable for low-power devices such as [microcontrollers](https://www.tensorflow.org/lite/microcontrollers). This data format is also required by integer-only accelerators such as the [Edge TPU](https://coral.ai/).

In this tutorial, you'll train an MNIST model from scratch, convert it into a Tensorflow Lite file, and quantize it. Finally, you'll check the accuracy of the converted models and compare it to the original float model.

You actually have several options as to how much you want to quantize a model.


## Setup

In order to quantize model, we need to use APIs added in TensorFlow r2.3:

In [124]:
import logging
logging.getLogger("tensorflow").setLevel(logging.DEBUG)

import tensorflow as tf
import numpy as np
assert float(tf.__version__[:3]) >= 2.3

## Generate a TensorFlow Model

We'll build a simple model to classify numbers from the [MNIST dataset](https://www.tensorflow.org/datasets/catalog/mnist).

This training won't take long because you're training the model for just a 5 epochs, which trains to about ~98% accuracy.

In [125]:
# Load MNIST dataset
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the input image so that each pixel value is between 0 to 1.
train_images = train_images.astype(np.float32) / 255.0
test_images = test_images.astype(np.float32) / 255.0

# Define the model architecture
model = tf.keras.Sequential([
  tf.keras.layers.InputLayer(input_shape=(28, 28)),
  tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
  tf.keras.layers.Conv2D(filters=12, kernel_size=(3, 3), activation='relu'),
  tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(10)
])

# Train the digit classification model
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(
                  from_logits=True),
              metrics=['accuracy'])
model.fit(
  train_images,
  train_labels,
  epochs=5,
  validation_data=(test_images, test_labels)
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f753e418bd0>

## Convert to a TensorFlow Lite model

You can convert the trained model to TensorFlow Lite format using the [`TFLiteConverter`](https://www.tensorflow.org/lite/convert/python_api) API, and apply varying degrees of quantization.

Beware that some versions of quantization leave some of the data in float format. So the following sections show each option with increasing amounts of quantization, until we get a model that's entirely int8 or uint8 data. (Notice we duplicate some code in each section so you can see all the quantization steps for each option.)

First, here's a converted model with no quantization:

In [126]:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
open("non-quant_model.tflite", "wb").write(tflite_model)

INFO:tensorflow:Assets written to: /tmp/tmpfx1of4z5/assets


INFO:tensorflow:Assets written to: /tmp/tmpfx1of4z5/assets


84576

In [127]:
interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()
input_type = interpreter.get_input_details()[0]['dtype']
print('input type: ', input_type)
output_type = interpreter.get_output_details()[0]['dtype']
print('output type: ', output_type)

input type:  <class 'numpy.float32'>
output type:  <class 'numpy.float32'>


It's now a TensorFlow Lite model, but it's still using 32-bit float values for all parameter data.

### **Technique-1: Post-training float16 quantization**

Weights are converted to 16-bit floating point values during model conversion from TensorFlow to TensorFlow Lite's flat buffer format. This results in a 2x reduction in model size. Some hardware, like GPUs, can compute natively in this reduced precision arithmetic, realizing a speedup over traditional floating point execution. The Tensorflow Lite GPU delegate can be configured to run in this way. However, a model converted to float16 weights can still run on the CPU without additional modification: the float16 weights are upsampled to float32 prior to the first inference. This permits a significant reduction in model size in exchange for a minimal impacts to latency and accuracy.

To quantize the model to float16 on export, first set the optimizations flag to use default optimizations. Then specify that float16 is the supported type on the target platform:

In [128]:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_fp16_model = converter.convert()
open("post-training_f16_quant.tflite", "wb").write(tflite_fp16_model)
tflite_model_fp16_file = tflite_models_dir/"post-training_f16_quant.tflite"
model_size = tflite_model_fp16_file.write_bytes(tflite_fp16_model)/1000
print("Model size after quantization: %s Kb"%model_size)

INFO:tensorflow:Assets written to: /tmp/tmp8dfd2u30/assets


INFO:tensorflow:Assets written to: /tmp/tmp8dfd2u30/assets


Model size after quantization: 44.528 Kb


In [129]:
interpreter_fp16 = tf.lite.Interpreter(model_path=str(tflite_model_fp16_file))
interpreter_quant.allocate_tensors()
input_type = interpreter_fp16.get_input_details()[0]['dtype']
print('input type: ', input_type)
output_type = interpreter_fp16.get_output_details()[0]['dtype']
print('output type: ', output_type)

input type:  <class 'numpy.float32'>
output type:  <class 'numpy.float32'>


### **Technique-2: Post-training dynamic range quantization**


Now let's enable the default `optimizations` flag to quantize all fixed parameters (such as weights). The activations are always stored in floating point. For ops that support quantized kernels, the activations are quantized to 8 bits of precision dynamically prior to processing and are de-quantized to float precision after processing. Depending on the model being converted, this can give a speedup over pure floating point computation.

In [130]:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

dynamic_quant = converter.convert()
open("post-training_dynamic_quant.tflite", "wb").write(dynamic_quant)
dynamic_quant_file = tflite_models_dir/"post-training_dynamic_quant.tflite"
model_size = dynamic_quant_file.write_bytes(dynamic_quant)/1000
print("Model size after quantization: %s Kb"%model_size)

INFO:tensorflow:Assets written to: /tmp/tmp7zg8yi40/assets


INFO:tensorflow:Assets written to: /tmp/tmp7zg8yi40/assets


Model size after quantization: 23.984 Kb


In [131]:
interpreter_quant = tf.lite.Interpreter(model_path=str(dynamic_quant_file))
interpreter_quant.allocate_tensors()
input_type = interpreter_quant.get_input_details()[0]['dtype']
print('input: ', input_type)
output_type = interpreter_quant.get_output_details()[0]['dtype']
print('output: ', output_type)

input:  <class 'numpy.float32'>
output:  <class 'numpy.float32'>


The model is now a bit smaller with quantized weights, but other variable data is still in float format.

### **Technique-3: Post-training integer quantization**

Integer quantization is an optimization strategy that converts 32-bit floating-point numbers (such as weights and activation outputs) to the nearest 8-bit fixed-point numbers.

1) **Convert using float fallback quantization**

To quantize the variable data (such as model input/output and intermediates between layers), you need to provide a [`RepresentativeDataset`](https://www.tensorflow.org/api_docs/python/tf/lite/RepresentativeDataset). This is a generator function that provides a set of input data that's large enough to represent typical values. It allows the converter to estimate a dynamic range for all the variable data. (The dataset does not need to be unique compared to the training or evaluation dataset.)
To support multiple inputs, each representative data point is a list and elements in the list are fed to the model according to their indices.


In [132]:
def representative_data_gen():
  for input_value in tf.data.Dataset.from_tensor_slices(train_images).batch(1).take(100):
    # Model has only one input so each data point has one element.
    yield [input_value]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen

tflite_quant_float = converter.convert()
open("quant_float.tflite", "wb").write(tflite_quant_float)
tflite_quant_float_file = tflite_models_dir/"quant_float.tflite"
model_size = tflite_quant_float_file.write_bytes(tflite_quant_float)/1000
print("Model size after quantization: %s Kb"%model_size)

INFO:tensorflow:Assets written to: /tmp/tmpmeun8087/assets


INFO:tensorflow:Assets written to: /tmp/tmpmeun8087/assets


Model size after quantization: 24.296 Kb


Now all weights and variable data are quantized, and the model is significantly smaller compared to the original TensorFlow Lite model.

However, to maintain compatibility with applications that traditionally use float model input and output tensors, the TensorFlow Lite Converter leaves the model input and output tensors in float:

In [133]:
interpreter_quant_float = tf.lite.Interpreter(model_path=str(tflite_model_quant_float_file))
interpreter_quant_float.allocate_tensors()
input_type = interpreter_quant_float.get_input_details()[0]['dtype']
print('input: ', input_type)
output_type = interpreter_quant_float.get_output_details()[0]['dtype']
print('output: ', output_type)

input:  <class 'numpy.float32'>
output:  <class 'numpy.float32'>


That's usually good for compatibility, but it won't be compatible with devices that perform only integer-based operations, such as the Edge TPU.

Additionally, the above process may leave an operation in float format if TensorFlow Lite doesn't include a quantized implementation for that operation. This strategy allows conversion to complete so you have a smaller and more efficient model, but again, it won't be compatible with integer-only hardware. (All ops in this MNIST model have a quantized implementation.)

So to ensure an end-to-end integer-only model, you need a couple more parameters...

2) **Convert using integer-only quantization**



To quantize the input and output tensors, and make the converter throw an error if it encounters an operation it cannot quantize, convert the model again with some additional parameters:

In [134]:
def representative_data_gen():
  for input_value in tf.data.Dataset.from_tensor_slices(train_images).batch(1).take(100):
    yield [input_value]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
# Ensure that if any ops can't be quantized, the converter throws an error
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Set the input and output tensors to uint8 (APIs added in r2.3)
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

tflite_quant_int = converter.convert()
open("quant_int.tflite", "wb").write(tflite_quant_int)
tflite_model_quant_int_file = tflite_models_dir/"quant_int.tflite"
model_size = tflite_model_quant_int_file.write_bytes(tflite_quant_int)/1000
print("Model size after quantization: %s Kb"%model_size)

INFO:tensorflow:Assets written to: /tmp/tmpvdu7nbpw/assets


INFO:tensorflow:Assets written to: /tmp/tmpvdu7nbpw/assets


Model size after quantization: 24.352 Kb


The internal quantization remains the same as above, but you can see the input and output tensors are now integer format:


In [135]:
interpreter_quant_int = tf.lite.Interpreter(model_path=str(tflite_model_quant_int_file))
interpreter_quant_int.allocate_tensors()
input_type = interpreter_quant_int.get_input_details()[0]['dtype']
print('input: ', input_type)
output_type = interpreter_quant_int.get_output_details()[0]['dtype']
print('output: ', output_type)

input:  <class 'numpy.uint8'>
output:  <class 'numpy.uint8'>


Now you have an integer quantized model that uses integer data for the model's input and output tensors, so it's compatible with integer-only hardware such as the [Edge TPU](https://coral.ai).

### **Technique-4: Quantization Aware Training**

Clone and fine-tune pre-trained model with quantization aware training

**Setup**

In [136]:
!pip install -q tensorflow-model-optimization

**Define the model**

You will apply quantization aware training to the whole model and see this in the model summary. All layers are now prefixed by "quant".

Note that the resulting model is quantization aware but not quantized (e.g. the weights are float32 instead of int8). The sections after show how to create a quantized model from the quantization aware one.

In the [comprehensive guide](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide.md), you can see how to quantize some layers for model accuracy improvements.

In [137]:
import tensorflow_model_optimization as tfmot

quantize_model = tfmot.quantization.keras.quantize_model

# q_aware stands for for quantization aware.
q_aware_model = quantize_model(model)

# `quantize_model` requires a recompile.
q_aware_model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

q_aware_model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
quantize_layer_3 (QuantizeLa (None, 28, 28)            3         
_________________________________________________________________
quant_reshape_3 (QuantizeWra (None, 28, 28, 1)         1         
_________________________________________________________________
quant_conv2d_3 (QuantizeWrap (None, 26, 26, 12)        147       
_________________________________________________________________
quant_max_pooling2d_3 (Quant (None, 13, 13, 12)        1         
_________________________________________________________________
quant_flatten_3 (QuantizeWra (None, 2028)              1         
_________________________________________________________________
quant_dense_3 (QuantizeWrapp (None, 10)                20295     
Total params: 20,448
Trainable params: 20,410
Non-trainable params: 38
_________________________________________________

**Train and evaluate the model against baseline**

To demonstrate fine tuning after training the model for just an epoch, fine tune with quantization aware training on the training data.

In [138]:
train_images_subset = train_images[0:1000] # out of 60000
train_labels_subset = train_labels[0:1000]

q_aware_model.fit(train_images_subset, train_labels_subset,
                  batch_size=500, epochs=1, validation_split=0.1)



<keras.callbacks.History at 0x7f74e4eac9d0>

**Create quantized model for TFLite backend**

After this, you have an actually quantized model with int8 weights and uint8 activations.


In [139]:
converter = tf.lite.TFLiteConverter.from_keras_model(q_aware_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quant_aware_training = converter.convert()
open("quant_aware_training.tflite", "wb").write(quant_aware_training)
quant_aware_training_file = tflite_models_dir/"quant_aware_training.tflit"
model_size = quant_aware_training_file.write_bytes(quant_aware_training)/1000
print("Model size after quantization: %s Kb"%model_size)



INFO:tensorflow:Assets written to: /tmp/tmpi8hakt7x/assets


INFO:tensorflow:Assets written to: /tmp/tmpi8hakt7x/assets


Model size after quantization: 24.688 Kb


## Run the TensorFlow Lite models

Now we'll run inferences using the TensorFlow Lite [`Interpreter`](https://www.tensorflow.org/api_docs/python/tf/lite/Interpreter) to compare the model accuracies.

First, we need a function that runs inference with a given model and images, and then returns the predictions:


In [140]:
# Helper function to run inference on a TFLite model
def run_tflite_model(tflite_file, test_image_indices):
  global test_images

  # Initialize the interpreter
  interpreter = tf.lite.Interpreter(model_path=str(tflite_file))
  interpreter.allocate_tensors()

  input_details = interpreter.get_input_details()[0]
  output_details = interpreter.get_output_details()[0]

  predictions = np.zeros((len(test_image_indices),), dtype=int)
  for i, test_image_index in enumerate(test_image_indices):
    test_image = test_images[test_image_index]
    test_label = test_labels[test_image_index]

    # Check if the input type is quantized, then rescale input data to uint8
    if input_details['dtype'] == np.uint8:
      input_scale, input_zero_point = input_details["quantization"]
      test_image = test_image / input_scale + input_zero_point

    test_image = np.expand_dims(test_image, axis=0).astype(input_details["dtype"])
    interpreter.set_tensor(input_details["index"], test_image)
    interpreter.invoke()
    output = interpreter.get_tensor(output_details["index"])[0]

    predictions[i] = output.argmax()

  return predictions


### Evaluate the models on all images

Now let's run both models using all the test images we loaded at the beginning of this tutorial:

In [141]:
# Helper function to evaluate a TFLite model on all images
def evaluate_model(tflite_file, model_type):
  global test_images
  global test_labels

  test_image_indices = range(test_images.shape[0])
  predictions = run_tflite_model(tflite_file, test_image_indices)

  accuracy = (np.sum(test_labels== predictions) * 100) / len(test_images)

  print('%s model accuracy is %.4f%% (Number of test samples=%d)' % (
      model_type, accuracy, len(test_images)))

**Technique-1 Evaluation**

In [142]:
evaluate_model(str(tflite_model_fp16_file), model_type="post-training_f16_quant.tflite")

post-training_f16_quant.tflite model accuracy is 98.1500% (Number of test samples=10000)


**Technique-2 Evaluation**

In [143]:
evaluate_model(str(dynamic_quant_file), model_type="post-training_dynamic_quant.tflite")

post-training_dynamic_quant.tflite model accuracy is 98.1400% (Number of test samples=10000)


**Technique-3 Evaluation**

In [144]:
evaluate_model(str(tflite_model_quant_float_file), model_type="quant_float.tflite")

quant_float.tflite model accuracy is 97.8600% (Number of test samples=10000)


In [145]:
evaluate_model(str(tflite_model_quant_int_file), model_type="quant_int.tflite")

quant_int.tflite model accuracy is 98.0300% (Number of test samples=10000)


**Technique-4 Evaluation**

In [146]:
evaluate_model(str(quant_aware_training_file), model_type="quant_aware_training.tflite")

quant_aware_training.tflite model accuracy is 98.1000% (Number of test samples=10000)


We can't really compare the accuracy of different quantization techniques. Though quantization aware training is often better for model accuracy. 

To learn more about other quantization strategies, read about [TensorFlow Lite model optimization](https://www.tensorflow.org/lite/performance/model_optimization).