<div>
  <img src="images/0.jpg" alt="Pruning Trees" style="width:100%;">
    <h1>Magnitude-Based Weight Pruning with Keras</h1>
      <br/>
      <p>
          Nowadays Deep Learning models require a significant amount of computing, memory, and power to be developed, deployed 
          and used, this creates an impediment in the conditions where real-time inference it is a priority or even when 
          running models on edge devices and browsers with limited computational resources. A major concern for current deep 
          learning models is <b>Energy efficiency</b>. A technique for dealing with this efficiency is enabling inference 
          efficiency and this is done through <b><i>Magnitude-Based Weight Pruning</i></b>.
      </p>
  </div>
</div>
<p>
    <br/>
    <b>What is weight pruning?</b>
    <br/>
    <br/>
    Magnitude-based weight pruning gradually zeroes out model weights during the training process to achieve model sparsity. 
    Sparse models are easier to compress, and as it allows to skip the zeroes during inference for achieving latency 
    improvements.
    <br/>
    <br/>
    This technique brings improvements via model compression.
    <br/>
    <br/>
    The technique is being evaluated in various speech applications, such as speech recognition and text-to-speech, and has 
    been experimented on across various vision and translation models.
    <br/>
    <br/>
    Through elimination of unnecessary values in the weight tensor, it is setting neural network parameters values to zero to 
    remove low-weight connections between the layers of a neural network.
    <br/>
    <br/>
    <b>Why is useful?</b>
    <br/>
    <br/>
    Tensors with several values set to zero can be considered sparse. This results in important benefits:
    <ul>
    <li><b>Compression</b>: Sparse tensors are amenable to compression by only keeping the non-zero values and their 
        corresponding coordinates.</li>
    <li><b>Speed</b>: Sparse tensors allow to skip otherwise unnecessary computations involving the zero values.</li>
    <li><b>Performance</b>: Sparse tensors by allowing to skip otherwise unnecessary computations involving the zero values,
        the neural network can generalized information better, leading to a perfomance gained.</li>
    </ul>
    <br/>
    <b>How does it work?</b>
    <br/>
    <br/>
    The <b><i>Keras-Based Weight Pruning API</i></b> is designed to iteratively remove connections based on their magnitude, 
    during training.
    <br/>
    <br/>
    This notebook, explains an end-to-end example of using the Keras-Based Weight Pruning API on a simple MNIST model. It will
    demostrate that by simply using a generic file compression algorithm (e.g. zip) the Keras model will be reduced in size and 
    that this size reduction persists when converted to a Tensorflow Lite format for devices with limited computacional 
    resources.
    <br/>
    <br/>
    Two things worth clarifying:
    <ul>
        <li><b><i>Magnitude-Based Weight Pruning</i></b> technique and the <b><i>Keras-Based Weight Pruning API</i></b> are not 
            TensorFlow Lite specific, this notebook just demostrates its application on the TensorFlow Lite backend, as it 
            covers size-sensitive use-cases.</li>
    <li>A sparse model will not be faster to execute, <b><i>by itself!</b></i>. It just enables backends with such capability.<i> According to the official documentation of Tensorflow, in the near future, however, TensorFlow Lite will take advantage of the sparsity to speed up computations</i>.</li>
    </ul>
    <br/>
    This notebook is divided by the next sections:
    <ul>
        <li>Training a MNIST model with Keras from scratch.</li>
        <li>Training a pruned MNIST with the Pruning API.</li>
        <li>Comparing the size of the pruned model and the non-pruned one after compression.</li>
        <li>Converting the pruned model to Tensorflow Lite format and verify that accuracy persists.</li>
        <li>Demonstrating how the pruned model works with other optimization techniques, like <a href=''>post-training quantization</a>.</li>
    </ul>
    <h2>Set Up</h2>
    <br/>
    To use Keras-Based Weight Pruning API, it is needed to install the <code>tensorflow-model-optimization</code> package. 
    Check the <a href='https://github.com/tensorflow/model-optimization'>TensorFlow model optimization repo</a> for API 
    versions compatibility.
    <br/>
    <br/>
    In this notebook few models will be train, it is recommended but not necessary to install the <code>tensorflow-gpu</code>
    package to speed up things.
</p>

In [2]:
!pip uninstall -y tensorflow-gpu
!pip uninstall -y tf-nightly-gpu
!pip install -q -U tensorflow-gpu

!pip install -q tensorflow-model-optimization

Found existing installation: tensorflow-gpu 2.1.0
Uninstalling tensorflow-gpu-2.1.0:
  Successfully uninstalled tensorflow-gpu-2.1.0




<h3>Import Necessary Packages, Modules & Libraries</h3>

In [1]:
%load_ext tensorboard
import tensorboard

<h4>Brief Introduction to Tensorboard</h4>
<br/>
In machine learning, to improve something you often need to be able to measure it.
<code>tensorboard</code>: TensorBoard is a tool for providing the measurements and visualizations needed during the machine learning workflow. It enables tracking experiment metrics like loss and accuracy, visualizing the model graph, projecting embeddings to a lower dimensional space, and much more.
<br/>
<br/>
If an alert emerges, saying <code>The tensorboard module is not an IPython extension.</code> from importing <code>tensorboad</code>, by following <a href='https://www.dlology.com/blog/how-to-run-tensorboard-in-jupyter-notebook/'>this guide</a> can be resolved.

By importing Tensorflow, it is now possible to automatically use enable eager execution.
<br/>
<br/>
<b>What is Eager Execution?</b>
<br/>
<br/>
According to <a href='https://www.tensorflow.org/guide/eager'>Tensorflow Guide</a> this is Eager Execution:
<br/>
<br/>
<div style='background-color:#f1fbff;padding:2%;'>
    TensorFlow's eager execution is an imperative programming environment that evaluates operations immediately, without 
    building graphs: operations return concrete values instead of constructing a computational graph to run later. This makes 
    it easy to get started with TensorFlow and debug models, and it reduces boilerplate as well. To follow along with this 
    guide, run the code samples below in an interactive python interpreter.
    <br/>
    <br/>
    Eager execution is a flexible machine learning platform for research and experimentation, providing:
    <ul>
        <li><i>An intuitive interface</i>—Structure your code naturally and use Python data structures. Quickly iterate on }
            small models and small data.</li>
        <li><i>Easier debugging</i>—Call ops directly to inspect running models and test changes. Use standard Python debugging 
            tools for immediate error reporting.</li>
        <li><i>Natural control flow</i>—Use Python control flow instead of graph control flow, simplifying the specification of 
            dynamic models.</li>
    </ul>
    Eager execution supports most TensorFlow operations and GPU acceleration.
</div>

In [2]:
import tensorflow as tf
import tempfile
import zipfile
import os

By importing the modudle <code>tempfile</code> it is possible to generate temporary files and directories.
<br/>
Thanks to the module <code>zipfile</code> and its tools it is possible to create, read, write, append, and list a ZIP file.
<br/>
The <code>os</code> provides a portable way of using operating system dependent functionality.
<br/>
<h3>Preparing Training Data</h3>

In [3]:
BATCH_SIZE = 128
NUM_CLASSES = 10
EPOCHS = 10

# Input Image Dimensions
IMG_ROWS, IMG_COLS = 28, 28

In [4]:
# The data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

if tf.keras.backend.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, IMG_ROWS, IMG_COLS)
    x_test = x_test.reshape(x_test.shape[0], 1, IMG_ROWS, IMG_COLS)
    INPUT_SHAPE = (1, IMG_ROWS, IMG_COLS)
else:
    x_train = x_train.reshape(x_train.shape[0], IMG_ROWS, IMG_COLS, 1)
    x_test = x_test.reshape(x_test.shape[0], IMG_ROWS, IMG_COLS, 1)
    INPUT_SHAPE = (IMG_ROWS, IMG_COLS, 1)

<h4>Data Preprocessing</h4>
<ul>
    <li>Convert features into <code>float32</code> data type.</li>
    <li>Divide features by 255 as standardization.</li>
    <ul>
        <li>Show train tensor shape.</li>
        <li>Show train and test length samples.</li>
    </ul>
    <li>Convert class vectors (or labels) to binary class matricess, Label Encoding.
</ul>

In [5]:
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples


In [6]:
# Convert class vectors to binary class matrices
y_train = tf.keras.utils.to_categorical(y_train, NUM_CLASSES)
y_test = tf.keras.utils.to_categorical(y_test, NUM_CLASSES)

<h2>Training a MNIST model without pruning</h2>
<h4>Building a Convolutional Neuronal Network from Scratch! This the MNIST model.</h4>

In [7]:
layer = tf.keras.layers

model = tf.keras.Sequential([
    layer.Conv2D(32, 5, padding='same', activation='relu', input_shape=INPUT_SHAPE),
    layer.MaxPooling2D((2, 2), (2, 2), padding='same'),
    layer.BatchNormalization(),
    layer.Conv2D(64, 5, padding='same', activation='relu'),
    layer.MaxPooling2D((2, 2), (2, 2), padding='same'),
    layer.Flatten(),
    layer.Dense(1024, activation='relu'),
    layer.Dropout(0.4),
    layer.Dense(NUM_CLASSES, activation='softmax')
])

Let's take a look at the Convolutional Neuronal Network MNIST model architecture built.

In [8]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 28, 28, 32)        832       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 32)        0         
_________________________________________________________________
batch_normalization (BatchNo (None, 14, 14, 32)        128       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 14, 14, 64)        51264     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 3136)              0         
_________________________________________________________________
dense (Dense)                (None, 1024)              3

<h4>Load TensorBoard to monitor the training process</h4>

In [9]:
logdir = tempfile.mkdtemp()
print('Writing training logs to ' + logdir)

Writing training logs to C:\Users\kleye\AppData\Local\Temp\tmpp7_wbg31


In [None]:
%tensorboard --logdir={logdir}

<h4>Training the MNIST model</h4>

In [11]:
callbacks = [tf.keras.callbacks.TensorBoard(log_dir=logdir, profile_batch=0)]

In [12]:
model.compile(
    loss=tf.keras.losses.categorical_crossentropy,
    optimizer='adam',
    metrics=['accuracy'])

In [13]:
model.fit(x_train, y_train,
          batch_size=BATCH_SIZE,
          epochs=EPOCHS,
          verbose=1,
          callbacks=callbacks,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [14]:
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.02898943738769276
Test accuracy: 0.9907


<h4>Save the original model for size comparison later</h4>

In [66]:
# Backend agnostic way to save/restore models
_, keras_file = tempfile.mkstemp('mnist_model.h5')
print('Saving model to: ', keras_file)
tf.keras.models.save_model(model, keras_file, include_optimizer=False)

Saving model to:  C:\Users\kleye\AppData\Local\Temp\tmpnlknlmv5mnist_model.h5


<h2>Training a Pruned MNIST Model with Pruning API</h2>
<br/>
<p>
    The <code>prune_low_magnitude()</code> API provides the ability to train models with removed connections. The <code>Keras-based</code> API can be applied at the level of individual layers or the entire model, in this notebook both options will be applied and explained.
    <br/>
    <br/>
    At a high level, given an schedule and a target sparsity the technique qorks by iteratively removing (i.e. zeroing out) connections between layers.
    <br/>
    <br/>
    An example, a regular configuration by targetin a 75% sparsity and by pruning connections (aka synapses) every 100 steps (aka epochs), starting from step 2,000.
    <br/>
    <img src="images/2.png" style="width:60%;height:600%"/>
    <h3>Building a Pruned Model Layer by Layer</h3>
    <br/>
    By building this pruned model, <i><b>its shown how to use the API at the level of layers, and build a pruned MNIST solver 
    model.</i></b>
    <br/>
    <br/>
    In this example, the <code>prune_low_magnitude()</code> <b>will recive as parameter the Keras layer whose weights are 
    wanted for pruning</b>.
    <br/>
    <br/>
    The <code>prune_low_magnitude()</code> function requires a pruning params, this pruning params configures the pruning algorithm during training. If it is wanted to have and known detailed documentation about this puring params, <a href='https://github.com/tensorflow/model-optimization'>the official github repository</a> offers a more in depth explanation and guide to this pruning params. The parameter used here means:
    <ul>
        <li><b>Sparsity</b>: <code>PolynomialDecay</code> is used through the training process. The training process starts 
            with a sparsity level of 50% and by gradually training the model to reach 90% sparsity. <b>X% sparsity means that 
            X% of the weight tensor is going to be pruned away</b>. In the beginning the pruning rate grows rapidly from 
            <code>initial_sparsity</code>, but then plateaus slowly to the target sparsity, which is the 
            <code>final_sparsity</code>. The function applied is:
            <br/>
            <br/>
            <div style="text-align:center;">
                $s_t=s_f+(s_i - s_f) (1 - \frac{t - t_0}{n \Delta t})^3\quad\text{for $\quad t$} \in \text{{$t_0,  t_0 + \Delta 
                t, \cdots, t_0 + n \Delta t $}}$
            </div>
            Symbols meaning:
            <ul style="list-style-type:none;">
                <li>$s_t$: Desired sparsity level.</li>
                <li>$s_i$: Initial sparsity value.</li>
                <li>$s_f$: Final sparsity value.</li>
                <li>$t$:   Current training step.</li>
                <li>$t_0$: Initial training step.</li>
                <li>$n$:   Span of pruning steps.</li>
                <li>$\Delta t$: Pruning frequency, <i><b>Schedule</i></b>.</li>
            </ul>
            <br/>
        </li>
        <li><b>Schedule</b>: Starting from step 2000 to the end of training, connections are pruned and every 100 steps runs as 
            schedule. This is because as it is wanted to train the model without pruning for a few epochs to reach a 
            certain accuracy, to aid convergence. Furthermore, the model is given some time to recover after each pruning step, 
            therefore <i><b>pruning does not happen on every step</i></b>. The pruning frequency (aka <i>schedule</i> or 
            $\Delta t$) is set 
            to 100.</li>
    </ul>
</p>

In [18]:
from tensorflow_model_optimization.sparsity import keras as sparsity

In the cell bellow, the training of the pruned Keras model starts, the model will start its training and at its 10 first epochs, the model will be saved in the disk and will be finally restored for continue training for 2 epochs.
<br/>
Thanks to gradual sparsity, there are four important parameters:
<ul>
    <li><code>begin_sparsity</code>.</li>
    <li><code>final_sparsity</code>.</li>
    <li><code>begin_step</code>.</li>
    <li><code>end_step</code>.</li>
</ul>
To calculate the <code>end step</code> parameter is done through the given number of train example, batch size and the total epochs to train.

In [19]:
import numpy as np
# Epochs for Pruning Model
EPOCHS_PM = 12
NUM_TRAIN_SAMPLES = x_train.shape[0]
END_STEP = np.ceil(1.0 * NUM_TRAIN_SAMPLES / BATCH_SIZE).astype(np.int32) * EPOCHS_PM
print('End step value: ' + str(END_STEP))

End step value: 5628


Then the dictionary of keyword arguments (<b>**kwargs</b>) <code>pruning_params</code> is created, it holds a the keyword <code>pruning_schedule</code> and the subclass <code>PolynomialDecay</code> from the class <code>sparsity</code>.

In [20]:
pruning_params = {
      'pruning_schedule': sparsity.PolynomialDecay(initial_sparsity=0.50,
                                                   final_sparsity=0.90,
                                                   begin_step=2000,
                                                   end_step=END_STEP,
                                                   frequency=100)
}

The sequential <code>pruned_model</code> model is defined and the <code>prune_low_magnitude()</code> functions recives as parameter the Keras layer whose weights are wanted and be pruned.

In [23]:
pruned_model = tf.keras.Sequential([
    sparsity.prune_low_magnitude(layer.Conv2D(32, 5, padding='same', activation='relu'), input_shape=INPUT_SHAPE, **pruning_params),
    layer.MaxPooling2D((2, 2), (2, 2), padding='same'),
    layer.BatchNormalization(),
    sparsity.prune_low_magnitude(layer.Conv2D(64, 5, padding='same', activation='relu'), **pruning_params),
    layer.MaxPooling2D((2, 2), (2, 2), padding='same'),
    layer.Flatten(),
    sparsity.prune_low_magnitude(layer.Dense(1024, activation='relu'), **pruning_params),
    layer.Dropout(0.4),
    sparsity.prune_low_magnitude(layer.Dense(NUM_CLASSES, activation='softmax'), **pruning_params)
])

Instructions for updating:
Please use `layer.add_weight` method instead.


The simple architecture of the model is shown.

In [24]:
pruned_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
prune_low_magnitude_conv2d_4 (None, 28, 28, 32)        1634      
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 14, 14, 32)        0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 14, 14, 32)        128       
_________________________________________________________________
prune_low_magnitude_conv2d_5 (None, 14, 14, 64)        102466    
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 7, 7, 64)          0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 3136)              0         
_________________________________________________________________
prune_low_magnitude_dense_3  (None, 1024)             

<h4>Load Tensorboard</h4>

In [25]:
logdir = tempfile.mkdtemp()
print('Writing training logs to ' + logdir)

Writing training logs to C:\Users\kleye\AppData\Local\Temp\tmptq2j1h2s


In [None]:
%tensorboard --logdir={logdir}

<h4>Training the Pruned MNIST Model</h4>
<br/>
The pruning starts from step 2000<sup>th</sup>, <b>when the accuracy of the model is lower than 98%!</b>
<br/>
<br/>
The <code>prunde_model</code> is compile, with:
<ul>
    <li><code>categorical_crossentropy</code> as loss function.</li>
    <li><code>adam</code> as the optimizer.</li>
    <li><code>accuracy</code> as the metric for evaluation.</li>
</ul>

In [28]:
pruned_model.compile(
    loss=tf.keras.losses.categorical_crossentropy,
    optimizer='adam',
    metrics=['accuracy'])

By using <code>callbacks</code> parameter, a pruning step callback is added to peg the pruning step to the optimizer's
step and also is added a callback to add pruning summaries to <code>Tensorboard</code>.

In [29]:
callbacks = [
    sparsity.UpdatePruningStep(),
    sparsity.PruningSummaries(log_dir=logdir, profile_batch=0)
]

<b>The model is ready to be trained and pruned!</b>

In [30]:
pruned_model.fit(x_train, y_train,
          batch_size=BATCH_SIZE,
          epochs=10,
          verbose=1,
          callbacks=callbacks,
          validation_data=(x_test, y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
INFO:tensorflow:Summary name prune_low_magnitude_conv2d_5/mask:0/sparsity is illegal; using prune_low_magnitude_conv2d_5/mask_0/sparsity instead.
INFO:tensorflow:Summary name prune_low_magnitude_dense_3/mask:0/sparsity is illegal; using prune_low_magnitude_dense_3/mask_0/sparsity instead.
INFO:tensorflow:Summary name prune_low_magnitude_dense_4/mask:0/sparsity is illegal; using prune_low_magnitude_dense_4/mask_0/sparsity instead.
INFO:tensorflow:Summary name prune_low_magnitude_conv2d_4/threshold:0/threshold is illegal; using prune_low_magnitude_conv2d_4/threshold_0/threshold instead.
INFO:tensorflow:Summary name prune_low_magnitude_conv2d_5/threshold:0/threshold is illegal; using prune_low_magnitude_conv2d_5/threshold_0/threshold instead.
INFO:tensorflow:Summary name prune_low_magnitude_dense_3/threshold:0/threshold is illegal; using prune_low_magnitude_dense_3/threshold_0/threshold instead.
INFO:tensorflow:Summary name prun

INFO:tensorflow:Summary name prune_low_magnitude_dense_3/threshold:0/threshold is illegal; using prune_low_magnitude_dense_3/threshold_0/threshold instead.
INFO:tensorflow:Summary name prune_low_magnitude_dense_4/threshold:0/threshold is illegal; using prune_low_magnitude_dense_4/threshold_0/threshold instead.
Epoch 7/10
INFO:tensorflow:Summary name prune_low_magnitude_conv2d_5/mask:0/sparsity is illegal; using prune_low_magnitude_conv2d_5/mask_0/sparsity instead.
INFO:tensorflow:Summary name prune_low_magnitude_dense_3/mask:0/sparsity is illegal; using prune_low_magnitude_dense_3/mask_0/sparsity instead.
INFO:tensorflow:Summary name prune_low_magnitude_dense_4/mask:0/sparsity is illegal; using prune_low_magnitude_dense_4/mask_0/sparsity instead.
INFO:tensorflow:Summary name prune_low_magnitude_conv2d_4/threshold:0/threshold is illegal; using prune_low_magnitude_conv2d_4/threshold_0/threshold instead.
INFO:tensorflow:Summary name prune_low_magnitude_conv2d_5/threshold:0/threshold is il

<tensorflow.python.keras.callbacks.History at 0x2574fe3af60>

Time to check the results of the <code>pruned_model</code>.
<h4>Is pruning working?</h4>

In [31]:
score = pruned_model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.021509590928414035
Test accuracy: 0.9931


The pruning has not been completed, there are still 2 more epochs to complete the pruning optimization, but the current results until now are slightly overcomming the result obtained from the regular model, the table displays a comparative between the losses and accuracies of both models.

<br/>
<table style="border: 1px solid #ddd;text-align: left;border-collapse: collapse;width: 50%;">
  <tr>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Model Type</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Loss</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Accuracy</th>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Regular MNIST Model</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0289</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9907</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Pruned MNIST Model</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0215</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9931</td>
  </tr>
</table>
<br/>
<h4>Save and Restore the Pruned MNIST Model</h4>
<br/>
In this part of the notebook, the training will be finished by completing the two remaining epochs, but first a checkpoint file must be prepared for saving the <code>pruned_model</code>.

In [54]:
_, checkpoint_pruned_model_file = tempfile.mkstemp('pruned_mnist_model.h5')
print('Saving pruned model to: ', checkpoint_file)

Saving pruned model to:  C:\Users\kleye\AppData\Local\Temp\tmpyy1fnkcnpruned_mnist_model.h5


<code>saved_model()</code> sets <code>include_optimizer</code> to <code>True</code> by default, because it is <b>required</b> to preserve the state of the optimizer across training sessions for pruning to work properly.

In [55]:
tf.keras.models.save_model(pruned_model, checkpoint_pruned_model_file, include_optimizer=True)

The function <code>prune_scope()</code> allows to load a model that has been pruned before. When loading a pruned model, it has to be loaded by the <code>prune_scope()</code> function for deseriazliation.

In [56]:
with sparsity.prune_scope():
    restored_pruned_mnist_model = tf.keras.models.load_model(checkpoint_pruned_model_file)

Set the <code>callbacks</code>, in this case is set up just as before, if wanted different callback can be integrated.

In [38]:
callbacks = [
    sparsity.UpdatePruningStep(),
    sparsity.PruningSummaries(log_dir=logdir, profile_batch=0)
]

<b>Fit the restore pruned model</b>

In [39]:
restored_pruned_mnist_model.fit(x_train, y_train,
                   batch_size=BATCH_SIZE,
                   epochs=2,
                   verbose=1,
                   callbacks=callbacks,
                   validation_data=(x_test, y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/2
INFO:tensorflow:Summary name prune_low_magnitude_conv2d_5_2/mask:0/sparsity is illegal; using prune_low_magnitude_conv2d_5_2/mask_0/sparsity instead.
INFO:tensorflow:Summary name prune_low_magnitude_dense_3_2/mask:0/sparsity is illegal; using prune_low_magnitude_dense_3_2/mask_0/sparsity instead.
INFO:tensorflow:Summary name prune_low_magnitude_dense_4_2/mask:0/sparsity is illegal; using prune_low_magnitude_dense_4_2/mask_0/sparsity instead.
INFO:tensorflow:Summary name prune_low_magnitude_conv2d_4_2/threshold:0/threshold is illegal; using prune_low_magnitude_conv2d_4_2/threshold_0/threshold instead.
INFO:tensorflow:Summary name prune_low_magnitude_conv2d_5_2/threshold:0/threshold is illegal; using prune_low_magnitude_conv2d_5_2/threshold_0/threshold instead.
INFO:tensorflow:Summary name prune_low_magnitude_dense_3_2/threshold:0/threshold is illegal; using prune_low_magnitude_dense_3_2/threshold_0/threshold instead.
INFO:tenso

<tensorflow.python.keras.callbacks.History at 0x2575528d128>

<b>Did the pruned model still beat regular model?</b>

In [40]:
score = restored_pruned_mnist_model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.02319358599687448
Test accuracy: 0.9928


The pruning has been completed, there are still more steps to properly finish this optimization technique. The final results from training for two more epochs has been calculated with a slightly decrease in the accuracy and a slightly increament in the loss function the pruned model trained for 12 epochs, did not performed as well as the pruned model trained by 10 epochs, <i><b>but both pruned models performed better in accuracies and losses than the accuracy and loss of the regular model.</i></b> The table below displays a comparative between the losses and accuracies of the models.

<br/>
<table style="border: 1px solid #ddd;text-align: left;border-collapse: collapse;width: 80%;">
  <tr>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Model Type</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Epochs</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Loss</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Accuracy</th>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Regular MNIST Model</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">10</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0289</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9907</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Pruned MNIST Model</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">10</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0215</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9931</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Restored Pruned MNIST Model</td>
      <td style="border: 1px solid #ddd;text-align: left;padding: 10px;"><i>2</i></td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0231</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9928</td>
  </tr>
</table>
<br/>
<h4>Exporting Preparation for Model Serving by <i>Striping the pruning wrappers from the Pruned Model</i></h4>
<br/> 
When the time for exporting a serving model comes, it must be needee to call the <code>strip_pruning()</code> API, the <code>strip_pruning()</code> will strip the pruning wrappers from the model, because it was only needed for training. It is essential to remember that the pruning wrappers are only needed for training.
<br/>
<br/>
The model for serving, is selected by the one with best perfomance in its loss function and metric(s). In this situtation the best model is the <code>pruned_model</code> with a loss function of 0.0215 and an accuracy of 0.9931.

In [57]:
serving_model = sparsity.strip_pruning(pruned_model)

Displaying its architecture.

In [49]:
serving_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_4 (Conv2D)            (None, 28, 28, 32)        832       
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 14, 14, 32)        0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 14, 14, 32)        128       
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 14, 14, 64)        51264     
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 7, 7, 64)          0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 3136)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1024)             

Once again a checkpoint file, must be created, this is specifically for the final model, which is the <code>serving_model</code>, <b>this model is the one that will be deployed into production!</b>

In [58]:
_, checkpoint_serving_model_file = tempfile.mkstemp('serving_model.h5')
print('Saving pruned model to: ', checkpoint_serving_model_file)

Saving pruned model to:  C:\Users\kleye\AppData\Local\Temp\tmpjnu9v28zserving_model.h5


There is no need of saving the optimizer with the graph for serving.

In [59]:
tf.keras.models.save_model(serving_model, checkpoint_serving_model_file, include_optimizer=False)

<h2>Comparing the Size of the Pruned Model and the Non-Pruned One After Compression</h2>
<br/>
Here by using the <code>tempfile.mkstemp()</code> class and its function, a temp zip file is created, then that file will have compressed the model, then the size of the model before and after compression will be displayed for matters of comparison.

In [52]:
_, zip_keras_file = tempfile.mkstemp('.zip') 
with zipfile.ZipFile(zip_keras_file, 'w', compression=zipfile.ZIP_DEFLATED) as f:
    f.write(keras_file)

print("Size of the unpruned model before compression: %.2f Mb" % (os.path.getsize(keras_file) / float(2**20)))
print("Size of the unpruned model after compression: %.2f Mb" % (os.path.getsize(zip_keras_file) / float(2**20)))

Size of the unpruned model before compression: 12.52 Mb
Size of the unpruned model after compression: 11.59 Mb


In [60]:
_, zip_serving_model_file = tempfile.mkstemp('.zip') 
with zipfile.ZipFile(zip_serving_model_file, 'w', compression=zipfile.ZIP_DEFLATED) as f:
    f.write(checkpoint_serving_model_file)

print("Size of the pruned model before compression: %.2f Mb" % (os.path.getsize(checkpoint_serving_model_file) / float(2**20)))
print("Size of the pruned model after compression: %.2f Mb" % (os.path.getsize(zip_serving_model_file) / float(2**20)))

Size of the pruned model before compression: 12.52 Mb
Size of the pruned model after compression: 2.50 Mb


<table style="border: 1px solid #ddd;text-align: left;border-collapse: collapse;width: 100%;">
  <tr>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Model Type</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Epochs</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Loss</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Accuracy</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Size Before .zip</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Size After .zip</th>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Regular MNIST Model</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">10</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0289</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9907</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">12.52 MB</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">11.59 MB</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Serving Model<br/>*Pruned MNIST Model</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">10</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0215</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9931</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">12.52 MB</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">2.51 MB</td>
  </tr>
</table>
<br/>
It is clear to see the benefits from Magnitude Based Weight Pruning and Compressing Pruned Models, until now those benefits are:
<ul>
    <li>Lower loss function.</li>
    <li>Higher accuracy score.</li>
    <li>Faster inference.</li>
    <li>Lighter models.</li>
</ul>
In this case by compressing an unpruned model its size was only reduced from <b>12.42 Mb</b> to <b>11.59 Mb</b>, a poor reduction of <b>+0.08</b> times and for the pruned model, its initial size was <b>12.42 Mb</b> and by applying compressing its size was reduced to <b>2.51 Mb</b>, an extremely huge reduction, its size was reduced by <b>+4.98</b> times!, <b>prunnig and compressing techniques, an excellent duo for applying to machine learning models.</b>
<br>
<h4>Pruning a Whole Model</h4>
<br/>
<code>prune_low_magnitude</code> function can also be applied to the entire Keras model, the algorithm will be applied to all layers that are ameanable to weight pruning (that the API knows about). Layers that the API knows are not ameanable to weight pruning will be ignored, and unknown layers to the API will cause an error.
<br/>
<br/>
If the model has layers that the API does not know how to prune their weights, but are perfectly fine to leave "un-pruned", it is possible to apply the API in a per-layer basis.
<br/>
<br/>
Regarding pruning configuration, the same settings apply to all prunable layers in the model.
<br/>
<br/>
<div style="background-color:#e7f3fe;padding:15px;width:50%;height:50%;margin:auto;text-align:center;">
  <p style="">
      <strong>Note:</strong>
      Pruning doesn't preserve the optimizer associated with the original model. As a result, it is necessary to re-compile the 
      pruned model with a new optimizer.
    </p>
</div>

This example will be developed by using an already existing model, which it is serialized, by taking the original MNIST regular model trained previously. Now starting by loading the serialized model into memory like this:

In [67]:
loaded_model = tf.keras.models.load_model(keras_file)



Time to prune the model loaded and compile the pruned model for training. Recap some conditions for preparing the model for pruning optimization:
<ul>
    <li>For this case training will restart from step 0.</li>
    <li>As the loaded model has already reached a satisfactory accuracy, pruning can be started inmediately.</li>
    <li>The parameter <code>begin_step</code> is set to 0.</li>
    <li>Train the loaded model by some certain epochs, in this case the constant <code>EPOCHS_WKM</code> referes to the epochs for training the whole Keras model.</li>
</ul>

In [83]:
EPOCHS_WKM = 4
END_STEP = np.ceil(1.0 * NUM_TRAIN_SAMPLES / BATCH_SIZE).astype(np.int32) * EPOCHS_WKM
print(END_STEP)

1876


Them the dictionary of keyword arguments (**kwargs) <code>new_pruning_params</code> is created, it holds a the keyword pruning_schedule and the subclass PolynomialDecay from the class sparsity and its parameters.

In [84]:
new_pruning_params = {
      'pruning_schedule': sparsity.PolynomialDecay(initial_sparsity=0.50,
                                                   final_sparsity=0.90,
                                                   begin_step=0,
                                                   end_step=END_STEP,
                                                   frequency=100)
}

Create the model as <code>new_pruned_model</code>, with the class <code>sparsity</code> and its function <code>prune_low_magnitude()</code>, the model <code>loaded_model</code> is passed as a parameter and the dictionary of keyword arguments <code>new_pruning_params</code> is also passed.

In [85]:
new_pruned_model = sparsity.prune_low_magnitude(model, **new_pruning_params)

Take a look at the <code>model</code> architecture.

In [86]:
new_pruned_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
prune_low_magnitude_conv2d ( (None, 28, 28, 32)        1634      
_________________________________________________________________
prune_low_magnitude_max_pool (None, 14, 14, 32)        1         
_________________________________________________________________
prune_low_magnitude_batch_no (None, 14, 14, 32)        129       
_________________________________________________________________
prune_low_magnitude_conv2d_1 (None, 14, 14, 64)        102466    
_________________________________________________________________
prune_low_magnitude_max_pool (None, 7, 7, 64)          1         
_________________________________________________________________
prune_low_magnitude_flatten  (None, 3136)              1         
_________________________________________________________________
prune_low_magnitude_dense (P (None, 1024)              6

<b>Time to compile </b><code>new_pruned_model</code><b>!</b>

In [87]:
new_pruned_model.compile(
    loss=tf.keras.losses.categorical_crossentropy,
    optimizer='adam',
    metrics=['accuracy'])

Load tensorboard, again..

In [74]:
logdir = tempfile.mkdtemp()
print('Writing training logs to ' + logdir)

Writing training logs to C:\Users\kleye\AppData\Local\Temp\tmpoyzfxx2i


In [None]:
%tensorboard --logdir={logdir}

<h4>Train the Model for Another <i>Four</i> Epochs</h4>
<br/>
Create a callback for adding a pruning step callback to peg the pruning step to the optimizer's step. Also add another callback to add pruning summaries to <code>Tensorboard</code>.

In [88]:
callbacks = [
    sparsity.UpdatePruningStep(),
    sparsity.PruningSummaries(log_dir=logdir, profile_batch=0)
]

<b>Fit</b> the <code>new_pruned_model</code> and wait for the results of its loss function and its metric score.

In [89]:
new_pruned_model.fit(x_train, y_train,
          batch_size=BATCH_SIZE,
          epochs=EPOCHS_WKM,
          verbose=1,
          callbacks=callbacks,
          validation_data=(x_test, y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/4
INFO:tensorflow:Summary name prune_low_magnitude_conv2d_1_2/mask:0/sparsity is illegal; using prune_low_magnitude_conv2d_1_2/mask_0/sparsity instead.
INFO:tensorflow:Summary name prune_low_magnitude_dense_5/mask:0/sparsity is illegal; using prune_low_magnitude_dense_5/mask_0/sparsity instead.
INFO:tensorflow:Summary name prune_low_magnitude_dense_1_2/mask:0/sparsity is illegal; using prune_low_magnitude_dense_1_2/mask_0/sparsity instead.
INFO:tensorflow:Summary name prune_low_magnitude_conv2d_3/threshold:0/threshold is illegal; using prune_low_magnitude_conv2d_3/threshold_0/threshold instead.
INFO:tensorflow:Summary name prune_low_magnitude_conv2d_1_2/threshold:0/threshold is illegal; using prune_low_magnitude_conv2d_1_2/threshold_0/threshold instead.
INFO:tensorflow:Summary name prune_low_magnitude_dense_5/threshold:0/threshold is illegal; using prune_low_magnitude_dense_5/threshold_0/threshold instead.
INFO:tensorflow:Summar

<tensorflow.python.keras.callbacks.History at 0x2575fbfde10>

<b>Did the Whole Pruned Model Beat the custom pruned model and regular model?</b>

In [90]:
score = new_pruned_model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.029654509916423388
Test accuracy: 0.9936


The pruned whole model, performs slightly better than the others models:
<br/>
<table style="border: 1px solid #ddd;text-align: left;border-collapse: collapse;width: 80%;">
  <tr>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Model Type</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Epochs</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Loss</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Accuracy</th>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Pruned MNIST Whole Model</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;"><i>4</i></td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0296</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9936</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Pruned MNIST Model</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">10</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0215</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9931</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Restored Pruned MNIST Model</td>
      <td style="border: 1px solid #ddd;text-align: left;padding: 10px;"><i>2</i></td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0231</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9928</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Regular MNIST Model</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">10</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0289</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9907</td>
  </tr>
</table>
<br/>
<h4>Exporting the Pruned Whole Model for Serving</h4>

In [91]:
final_whole_pruned_model = sparsity.strip_pruning(new_pruned_model)

At last look at its arch.

In [92]:
final_whole_pruned_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 28, 28, 32)        832       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 32)        0         
_________________________________________________________________
batch_normalization (BatchNo (None, 14, 14, 32)        128       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 14, 14, 64)        51264     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 3136)              0         
_________________________________________________________________
dense (Dense)                (None, 1024)              3

<h3>Preparing for Saving the Whole Pruned Model</h3>


In [102]:
_, new_pruned_keras_file = tempfile.mkstemp('final_whole_pruned_model.h5')
print('Saving pruned model to: ', new_pruned_keras_file)

Saving pruned model to:  C:\Users\kleye\AppData\Local\Temp\tmpzeb2lnfofinal_whole_pruned_model.h5


<b>Save the whole pruned model ready for being deployed into production!</b>

In [95]:
tf.keras.models.save_model(final_whole_pruned_model, new_pruned_keras_file, include_optimizer=False)

<h4>Compressing the Whole Pruned Model</h4>

In [96]:
_, zip_final_whole_pruned_model = tempfile.mkstemp('.zip')
with zipfile.ZipFile(zip_final_whole_pruned_model, 'w', compression=zipfile.ZIP_DEFLATED) as f:
    f.write(new_pruned_keras_file)

print("Size of the pruned model before compression: %.2f Mb" % (os.path.getsize(new_pruned_keras_file) / float(2**20)))
print("Size of the pruned model after compression: %.2f Mb" % (os.path.getsize(zip_final_whole_pruned_model) / float(2**20)))

Size of the pruned model before compression: 12.52 Mb
Size of the pruned model after compression: 2.42 Mb


The <code>final_whole_pruned_model</code> it is to have a size of <b>12.52 Mb</b> before compression, after compressing the model its new size went down to <b>2.42 Mb</b>, amazing <b>+5.17</b> times lighter. Its comparison with the other models:
<br/>
<table style="border: 1px solid #ddd;text-align: left;border-collapse: collapse;width: 80%;">
  <tr>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Model Type</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Epochs</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Total of Epochs</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Loss</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Accuracy</th>  
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Size Before .zip</th>
    <th style="border: 1px solid #ddd;text-align: left;padding: 10px;">Size After .zip</th>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Pruned MNIST Whole Model</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;"><i>4</i></td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;"><i>14</i></td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0296</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9936</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">12.52 MB</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">2.42 MB</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Pruned MNIST Model</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">10</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">10</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0215</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9931</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">12.52 MB</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">2.51 MB</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Restored Pruned MNIST Model</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;"><i>2</i></td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;"><i>12</i></td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0231</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9928</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Not Calculated</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Not Calculated</td>  
  </tr>
  <tr>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">Regular MNIST Model</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">10</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">10</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.0289</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">0.9907</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">12.52 MB</td>
    <td style="border: 1px solid #ddd;text-align: left;padding: 10px;">11.59 MB</td>
  </tr>
</table>
<br/>
Definetly the <code>final_whole_pruned_model</code> it is the best model in terms of metrics scores and size, but its loss functions its the worst of the losses functions from the models.
<br/>
<h2>Converting to TensorFlow Lite</h2>
<br/>
Converting the whole pruned model to a format that's runnable on the targeting backend. <code>Tensorflow Lite</code> is an example format that can be used to deploy to mobile devices. To convert to a <code>Tensorflow Lite</code> graph, it is needed use the <code>TFLiteConverter</code> as below:

In [118]:
tflite_model_file = 'C:/Users/kleye/AppData/Local/Temp/mpzeb2lnfofinal_whole_pruned_model.tflite'
converter = tf.lite.TFLiteConverter.from_keras_model(final_whole_pruned_model)
#tflite_model = converter.convert()
# /
with open(tflite_model_file, 'wb') as f:
    # f.write(tflite_model)
    f.write(converter.convert())

<h4>Compressing the TensorFlow Lite Model</h4>

In [119]:
_, zip_tflite = tempfile.mkstemp('.zip')
with zipfile.ZipFile(zip_tflite, 'w', compression=zipfile.ZIP_DEFLATED) as f:
    f.write(tflite_model_file)
print("Size of the tflite model before compression: %.2f Mb" % (os.path.getsize(tflite_model_file) / float(2**20)))
print("Size of the tflite model after compression: %.2f Mb" % (os.path.getsize(zip_tflite) / float(2**20)))

Size of the tflite model before compression: 12.49 Mb
Size of the tflite model after compression: 2.31 Mb


As <code>TensorFlow Lite</code> is a lighter version of <code>TensorFlow</code>, for mobile devices, the model had <b>12.49 Mb</b> and after compressing it, its new size is <b>2.31 Mb</b>, a reduction of <b>+5.4</b> times.
<br/>
<h3>Evaluating the Accuracy of the TensorFlow Lite Model</h3>
<br/>
The <code>Interpreter</code> class is an interpreter for a graph of nodes that input and output from tensors, each node of the graph processes a set of input tensors and produces a set of output Tensors. All inputs/output tensors are referenced by index.
<br/>
<br/>
The <code>allocate_tensors()</code> updates allocations for all tensors. This will redim dependent tensors using the input tensor dimensionality as given. This is relatively expensive. If you know that your sizes are not changing, you need not call this. Returns status of success or failure.
<br/>
<br/>
The <code>get_input_details()</code> function gets the input models details, in this case it returns the value of <code>INPUTE_SHAPE</code>, a list of input details.
<br/>
<br/>
The <code>get_output_details()</code> functions get model output details and it returns a list of output details.

In [120]:
interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()
input_index = interpreter.get_input_details()[0]["index"]
output_index = interpreter.get_output_details()[0]["index"]

The method <code>eval_model</code>, evaluates the performance of the <code>TFLite</code> model and returns its accuracy. Inside this method there are three functions that are worth to explain and they are:
<ul>
    <li><code>set_tensor()</code>: Sets the value of the input tensor.</li>
    <li><code>invoke()</code>: Invokes the interpreter. Be sure to set the input sizes, allocate tensors and fill values before 
        calling this. Also, note that this function releases the GIL so heavy computation can be done in the background while 
        the Python interpreter continues. No other function on this object should be called while the invoke() call has not 
        finished.</li>
    <li><code>get_tensor()</code>:  Gets the value of the input tensor (get a copy), this function cannot be used to read 
        intermediate results, this function returns a numpy array. If it is wished to avoid the copy, therefore the function 
        <code>tensor()</code> should be used.</li>
</ul>

In [121]:
def eval_model(interpreter, x_test, y_test):
    total_seen = 0
    num_correct = 0
    
    for img, label in zip(x_test, y_test):
        inp = img.reshape((1, 28, 28, 1))
        total_seen += 1
        interpreter.set_tensor(input_index, inp)
        interpreter.invoke()
        predictions = interpreter.get_tensor(output_index)
        if np.argmax(predictions) == np.argmax(label):
            num_correct += 1
        
        if total_seen % 1000 == 0:
            print("Accuracy after %i images: %f" % (total_seen, float(num_correct) / float(total_seen)))
    
    return float(num_correct) / float(total_seen)

Time to see the performance.

In [122]:
print(eval_model(interpreter, x_test, y_test))

Accuracy after 1000 images: 0.989000
Accuracy after 2000 images: 0.990500
Accuracy after 3000 images: 0.989667
Accuracy after 4000 images: 0.990000
Accuracy after 5000 images: 0.990400
Accuracy after 6000 images: 0.991167
Accuracy after 7000 images: 0.991714
Accuracy after 8000 images: 0.992750
Accuracy after 9000 images: 0.993556
Accuracy after 10000 images: 0.993600
0.9936


Its performance was not affected! Great!
<h2>Demonstrating How the Pruned Model Works With Other Optimization Techniques, Like Post-Training Quantization</h2>
<br/>
It is possible to combine pruning with other optimization techniques like Post Training Quantization.
<br/>
<br/>
<b>What is Post Training Quantization?</b>
<br/>
<br/>
Post Training Quantization converts weights to 8 bit precision as part of model conversion from keras model to TFLite's flat buffer, resulting in a 4x reduction in the model size. 
<a href='https://github.com/kleyersoma/GuideForPostTrainingQuantization' target="_blank">Check this Guide for Post Training Quantization</a>.
<br/>
<br/>
In the following example, <code>the_final whole_pruned_model</code> is taken, converted with Post Training Quantization, check the size reduction and validate its accuracy.

In [123]:
converter = tf.lite.TFLiteConverter.from_keras_model(final_whole_pruned_model)

converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]

tflite_quant_model = converter.convert()

tflite_quant_model_file = 'C:/Users/kleye/AppData/Local/Temp/mpzeb2lnfofinal_whole_pruned_model.tflite'
with open(tflite_quant_model_file, 'wb') as f:
    f.write(converter.convert())

Let's see how its size was affected.

In [125]:
_, zip_tflite = tempfile.mkstemp('.zip')
with zipfile.ZipFile(zip_tflite, 'w', compression=zipfile.ZIP_DEFLATED) as f:
    f.write(tflite_quant_model_file)
print("Size of the tflite model before compression: %.2f Mb" % (os.path.getsize(tflite_quant_model_file) / float(2**20)))
print("Size of the tflite model after compression: %.2f Mb" % (os.path.getsize(zip_tflite) / float(2**20)))

Size of the tflite model before compression: 3.13 Mb
Size of the tflite model after compression: 0.57 Mb


The quantized model has a size of roughly 1/4 of the orignial one, it is <b>+21.96</b> times lighter than the original model for <code>TensroFlow</code> and for the <code>TensorFlow Lite</code> it is <b>+5.49</b> times lighter.

<h3>Evaluating the Performance Of a Whole Prune Compressed Post-Training Quantized TensorFlow Lite Model</h3>
<br/>
As it was done before, the class <code>Interpreter</code> is invoked, its tensors are allocate by <code>allocate_tensors()</code>, the input details are obtained by the function <code>get_input_details()</code>, the output details are obtained by the function <code>get_output_details()</code> and the method <code>eval_model()</code>, runs the evaluations.

In [127]:
interpreter = tf.lite.Interpreter(model_path=str(tflite_quant_model_file))
interpreter.allocate_tensors()
input_index = interpreter.get_input_details()[0]["index"]
output_index = interpreter.get_output_details()[0]["index"]

print(eval_model(interpreter, x_test, y_test))

Accuracy after 1000 images: 0.989000
Accuracy after 2000 images: 0.990500
Accuracy after 3000 images: 0.989667
Accuracy after 4000 images: 0.990000
Accuracy after 5000 images: 0.990400
Accuracy after 6000 images: 0.991167
Accuracy after 7000 images: 0.991714
Accuracy after 8000 images: 0.992750
Accuracy after 9000 images: 0.993556
Accuracy after 10000 images: 0.993700
0.9937


Lighter, faster and preciser, the model is performing even better.
<br/>
<br/>
To the stellar duo of Pruning and Compressing, a new technique partners with them, it is Post Training Quantization, <i>they are  definetly a hot thresome!</i>

<br/>
<hr>
<br/>
<h2>Conclusion</h2>
<br/>
    This notebooks, teaches how to create sparse models with the TensorFlow model optimization toolkit weight pruning API.
    This allows to create models that take significant less space on disk.
    <i><b>The resulting model can also be more efficiently implemented to avoid computation; in the future TensorFlow Lite will 
    provide such capabilities.</i></b>
    <br/>
    <br/>
    In this notebook, step by step explained through an end-to-end example of training a simple MNIST model that used the 
    weight pruning API. The notebook teaches how to convert a model to the Tensorflow Lite format for mobile deployment, and 
    demonstrated how with simple file compression the model size was even reduced +5 times.
    <br/>
    <br/>
    <b>This new capability on Keras models can be particularly important for deployment in resource-constraint environments.
    </b>
<h2><i><a href='https://www.tensorflow.org/model_optimization/guide/pruning'>Tips from the official docs</a></i></h2>
<ul>
    <li>Start with a pre-trained model or weights if possible. If not, create one without pruning and start after.</li>
    <li>Do not prune very frequently to give the model time to recover. </i>The toolkit provides a default frequency!</i></li>
    <li>Try running an experiment where pruning a pre-trained model to the final sparsity with begin step 0.</li>
    <li>Have a learning rate that's not too high or too low when the model is pruning. <b><i>Consider the pruning schedule to be a 
        hyperparameter</i></b>.</li>
</ul>
<h2>References</h2>
<ul>
    <li><a href='https://arxiv.org/pdf/1710.01878.pdf'>To prune, or not to prune: exploring the efficacy of pruning for model compression</a>, by Michael H. Zhu and Suyog Gupta.</li>
    <li><a href='https://www.tensorflow.org/model_optimization/guide/pruning'>Tensorflow Model Optimization Guide to Pruning with Keras</a>.</li>
</ul>