<a href="https://colab.research.google.com/github/poojamahajan0712/AI_ML_concepts/blob/main/Quantization/Quantization_NB2_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Reference - https://medium.com/game-of-bits/optimizing-tensorflow-models-using-quantization-fb4d09b46fac
# https://ai.google.dev/edge/litert/models/post_training_quantization

* key idea behind quantization - These techniques aim at providing smaller and faster models while keeping the performance of the models almost similar.
* Post-training quantization -  the deep learning model is trained with FP-32 tensors and later converted to INT-8(or float-16) in order to get a smaller and faster model for deployment. it is a bit more stable than quantization aware training and easy to use.
* In post-quantization techniques, we train the deep learning model normally and save the weights. These weights are later converted into TFLite format and quantized.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import os
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Input, Dense, Conv2D, Flatten
from tensorflow.keras.models import Model


%matplotlib inline

In [2]:
#loading dataset
digits = load_digits()
images = digits['images']
labels = digits['target']
print (images.shape, labels.shape)

#Splitting Data
X_train, X_test, y_train, y_test = train_test_split(images, labels, test_size=0.25, random_state=42)
X_train = np.expand_dims(X_train, axis=-1)
X_test = np.expand_dims(X_test, axis=-1)
print (X_train.shape, X_test.shape, y_train.shape, y_test.shape)

#Encoding Labels
def get_encoded_labels(target):
    output=np.zeros((len(target),10))
    for ix, value in enumerate(target):
        output[ix][target[ix]] = 1
    return output

Y_train = get_encoded_labels(y_train)
Y_test = get_encoded_labels(y_test)
print (Y_train.shape, Y_test.shape)


(1797, 8, 8) (1797,)
(1347, 8, 8, 1) (450, 8, 8, 1) (1347,) (450,)
(1347, 10) (450, 10)


In [3]:


input_layer = Input(shape=(8, 8, 1))
layer = Conv2D(64, (3,3), activation='relu')(input_layer)
layer = Conv2D(32, (3,3), activation='relu')(layer)
layer = Conv2D(32, (3,3), activation='relu')(layer)
layer = Flatten()(layer)
features = Dense(32, activation='relu')(layer)
output = Dense(10, activation='softmax')(features)


model = Model(inputs=input_layer, outputs=output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()



In [4]:
model.fit(X_train, Y_train, batch_size=32, epochs=10, validation_data=(X_test, Y_test))

Epoch 1/10
[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 20ms/step - accuracy: 0.4809 - loss: 1.6126 - val_accuracy: 0.8956 - val_loss: 0.3401
Epoch 2/10
[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.8992 - loss: 0.3236 - val_accuracy: 0.9444 - val_loss: 0.2030
Epoch 3/10
[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.9302 - loss: 0.2404 - val_accuracy: 0.9578 - val_loss: 0.1474
Epoch 4/10
[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 15ms/step - accuracy: 0.9644 - loss: 0.1030 - val_accuracy: 0.9689 - val_loss: 0.1101
Epoch 5/10
[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - accuracy: 0.9842 - loss: 0.0522 - val_accuracy: 0.9689 - val_loss: 0.0964
Epoch 6/10
[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 18ms/step - accuracy: 0.9856 - loss: 0.0501 - val_accuracy: 0.9644 - val_loss: 0.1172
Epoch 7/10
[1m43/43[0m [32m━━━━

<keras.src.callbacks.history.History at 0x7cfc0df7ee90>

In [5]:
def get_test_accuracy(predictions, target):
    correct = 0
    for ix, pred in enumerate(predictions):
        true_value = target[ix]
        if pred[true_value] == max(pred):
            correct += 1
    return correct*100/len(target)
predictions = model.predict(X_test)
get_test_accuracy(predictions, y_test)

[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step


97.11111111111111

In [6]:
model.save("saved_model.keras")

### Post training quantization

1. Dynamic range quantization -
* Dynamic range quantization provides reduced memory usage and faster computation without you having to provide a representative dataset for calibration.
- The key point here is that you don't need a representative dataset for calibration- This means you can apply dynamic range quantization directly to an already-trained model without needing additional data to fine-tune the quantization process
* This type of quantization, statically quantizes only the weights from floating point to integer at conversion time, which provides 8-bits of precision
* To further reduce latency during inference, "dynamic-range" operators dynamically quantize activations based on their range to 8-bits and perform computations with 8-bit weights and activations.

---- Dynamic Range Quantization:

* Precision: Typically uses 8-bit integers.

* Calibration: Does not require a representative dataset for calibration.

* Use Case: Suitable for models where calibration data is not available or practical.

* Performance: Reduces memory usage and speeds up computation without significant loss in accuracy


* Quantization Process: During training, the weights are in floating-point format. In dynamic range quantization, these weights are converted to 8-bit integers for storage.

* Inference: At runtime, the model converts these 8-bit integers back to floating-point values for computation. This means there's some overhead in converting between formats, but it still offers performance benefits due to reduced memory usage.

In [7]:
## converting to tflite model
converter = tf.lite.TFLiteConverter.from_keras_model(model)

## applying quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

Saved artifact at '/tmp/tmpu1lz9uma'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 8, 8, 1), dtype=tf.float32, name='keras_tensor')
Output Type:
  TensorSpec(shape=(None, 10), dtype=tf.float32, name=None)
Captures:
  137421982223520: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982225984: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982226688: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982228976: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982229504: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982231792: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982232848: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982235136: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982235664: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982234080: TensorSpec(shape=(), dtype=tf.resource, name=None)


In [8]:
with open('quantized_model.tflite', 'wb') as f:
    f.write(tflite_quant_model)

Inferencing with quantised model

In [9]:

interpreter = tf.lite.Interpreter(model_path="quantized_model.tflite")

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
interpreter.allocate_tensors()

print(input_details)
print(output_details)

[{'name': 'serving_default_keras_tensor:0', 'index': 0, 'shape': array([1, 8, 8, 1], dtype=int32), 'shape_signature': array([-1,  8,  8,  1], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}]
[{'name': 'StatefulPartitionedCall_1:0', 'index': 18, 'shape': array([ 1, 10], dtype=int32), 'shape_signature': array([-1, 10], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}]


In [10]:
input_details[0]

{'name': 'serving_default_keras_tensor:0',
 'index': 0,
 'shape': array([1, 8, 8, 1], dtype=int32),
 'shape_signature': array([-1,  8,  8,  1], dtype=int32),
 'dtype': numpy.float32,
 'quantization': (0.0, 0),
 'quantization_parameters': {'scales': array([], dtype=float32),
  'zero_points': array([], dtype=int32),
  'quantized_dimension': 0},
 'sparsity_parameters': {}}

In [11]:
# interpreter.set_tensor() is used to set or assign the input data (the image) to the input tensor for the model to process.
# After calling set_tensor() to assign the input, you need to invoke the model to run the inference and calculate the outputs.
# get_tensor(): This method retrieves the output tensor after the inference is complete.After the model has run the inference, the results are stored in the output tensor, and get_tensor() allows you to access these results.


predictions = []
for img in X_test:
    interpreter.set_tensor(input_details[0]['index'], [img.astype('float32')])
    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    predictions.append(output_data[0])

predictions = np.array(predictions)
get_test_accuracy(predictions, y_test)

97.11111111111111

comparing model size

In [12]:
def get_model_size(model_path):
  model_size = os.path.getsize(model_path)
  model_size_mb = model_size / (1024*1024)
  print(f"model size: {model_size_mb:.2f} MB")

In [13]:
model_path = "/content/saved_model.keras"
get_model_size(model_path)

model size: 0.41 MB


In [14]:
model_path = "/content/quantized_model.tflite"
get_model_size(model_path)

model size: 0.04 MB


2. Float-16  Quantization
* Float-16 quantization reduces the model-size by converting model weights from FP-32 to FP-16 numbers. This technique reduces the model size to approximately half and results in minimum accuracy loss.

* Precision: Uses 16-bit floating-point numbers.

* Calibration: Does not require a representative dataset for calibration.

* Use Case: Best for models that will run on hardware with support for 16-bit floating-point operations, such as GPUs.

* Performance: Offers a balance between memory usage and computational speed, with less precision loss compared to 8-bit quantization.

* Quantization Process: Weights are converted from 32-bit floating-point to 16-bit floating-point format.

* Inference: The model uses these 16-bit floating-point values directly for computation. This provides a balance between reducing memory usage and maintaining precision, especially on hardware optimized for floating-point operations like GPUs.

In [17]:

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_quant_model2 = converter.convert()

with open('quantized_model2.tflite', 'wb') as f:
    f.write(tflite_quant_model2)

Saved artifact at '/tmp/tmp9jszrvot'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 8, 8, 1), dtype=tf.float32, name='keras_tensor')
Output Type:
  TensorSpec(shape=(None, 10), dtype=tf.float32, name=None)
Captures:
  137421982223520: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982225984: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982226688: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982228976: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982229504: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982231792: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982232848: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982235136: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982235664: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982234080: TensorSpec(shape=(), dtype=tf.resource, name=None)


In [18]:
model_path = "/content/quantized_model2.tflite"
get_model_size(model_path)

model size: 0.07 MB


3. Full integer quantization



* Precision: Uses 8-bit integers for both weights and activations.
* Calibration: Requires a representative dataset to determine the scale and zero-point values.

* Use Case: Ideal for models where you have access to calibration data and want to maximize performance on integer-only hardware.

* Performance: Further reduces memory usage and improves latency compared to dynamic range quantization.

* Quantization Process: Both weights and activations are converted to 8-bit integers.

* Inference: During inference, the model uses these 8-bit integers directly for computation. This avoids the overhead of converting back to floating-point values, making the process faster on hardware that supports integer operations.*

In [22]:

num_calibration_steps=1
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

def representative_dataset_gen():
    for _ in range(num_calibration_steps):
        input_data = [X_test[:10].astype('float32')]
        yield input_data

converter.representative_dataset = representative_dataset_gen
tflite_quant_model3 = converter.convert()

with open('quantized_model3.tflite', 'wb') as f:
    f.write(tflite_quant_model3)

Saved artifact at '/tmp/tmpn18u9_2d'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 8, 8, 1), dtype=tf.float32, name='keras_tensor')
Output Type:
  TensorSpec(shape=(None, 10), dtype=tf.float32, name=None)
Captures:
  137421982223520: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982225984: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982226688: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982228976: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982229504: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982231792: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982232848: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982235136: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982235664: TensorSpec(shape=(), dtype=tf.resource, name=None)
  137421982234080: TensorSpec(shape=(), dtype=tf.resource, name=None)




In [23]:
model_path = "/content/quantized_model3.tflite"
get_model_size(model_path)

model size: 0.04 MB
