In [3]:
from IPython.display import Image

# Network Optimizations

Optimizing nural networks for faster inferences is an important technique in deep learning. The research community has proposed many such optimizations, including factoring [matrix-vector-product](https://www.microsoft.com/en-us/research/publication/restructuring-of-deep-neural-network-acoustic-models-with-singular-value-decomposition/) and [convolution](https://arxiv.org/abs/1511.06530) operations, [binarization/quantization](https://arxiv.org/abs/1609.07061), [sparsification](https://arxiv.org/abs/1611.06473) and the use of [frequency-domain representations](https://dl.acm.org/citation.cfm?id=3062228).

The goal of `cntk.contrib.netopt` module is to provide users of CNTK an easy-to-use interfaces to speed up or compress their networks using such optimizations, and writers of optimizations a framework within which to export them to CNTK users.  The initial release of `netopt`  supports factoring of `Dense` CNTK layers and the 1-bit binarization of `Convolutional` layers.

The purpose of this manual is to help you understand some of the facilities CNTK provides to make the development of deep learning models easier. Some of the advice here are considered good programming practices in general, but we will still cover them in the context of building models.

## Description
`netopt` breaks down the process of optimizing models into three steps:

1. Replace operations (or blocks) in the network with "cheaper", possibly user-defined operations that are trainable, i.e., have forward and back-prop functions defined. We call the resulting network the *transformed* network.

2. [Optionally] Retrain the transformed network. Almost always, the transformation results in a significant drop in accuracy, and re-training the transformed version fairly quickly regains some of the lost accuracy.

3. [Optionally] Replace operations in the retrained network with operations that have fast implementations on particular processors. These new operations may not be trainable (i.e., have no back-prop functions defined), and can only be run in forward mode. We call the resulting network a *lowered* network. Lowering is sometimes necessary to harvest the benefits of optimization. For example, a binarized convolution layer is only useful if the new layer is implemeted using a lowered version of the convolution algorithm optimized for 1-bit values.

Each optimization provided via network should take as argument a model to be optimized, an optional training function for that model, and other optional arguments specific to that optimzation, and returns a model (possibly in a lowered representation) on which the above sequence of transformation, training and lowering has been performed.

## Factorization of dense layers
This optimization, accessed via `factorization.factor_dense(.)` function,  replaces `Dense` layers with factored variants. In particular, a dense layer encapsulates the operation *r = W*x + b*. If *W* is of shape *m\*n*, this operation will take *O(mn)* operations. Now consider a (approximate) *factorization* of *W*: *W ~ UV*, where *U* has shape *m\*k* and *V* has shape *k\*n*. If we replace *r* with *r' = (U * (V*x)) + b*, then computing *r'* will take *O(mk + kn)* operations. If *k* is much smaller than *m* or *n*, this can yield large savings. `netopt` allows users to optionally specify a "factorization function" to use to derive *U* and *V* from *W*. Alternately, it allows users to specify a "rank function" that can examine *W* and derive a rank *k* that should be used for factorization. In this case, which we expect to the be common use case, `netopt` will perform a rank-*k* [singular-value decomposition](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.linalg.svd.html) to produce *U* and *V*.

`netopt.examples.dense_factorization_example()` shows how to use dense factorization to transform a graph. In this example, no re-training of the transformed graph is performed. Further, factorization does not require lowering of the transformed graph, since the operations used by the transformed graph (matrix-vector multiply over floats) is already well-supported by the default CNTK implementation.

### Using Factorization
We will use [CNTK 102: Feed Forward Network with Simulated Data](https://cntk.ai/pythondocs/CNTK_102_FeedForward.html) tutorial to demonstrate the use of factorization. 
We assume you have a working version of the code for the above tutorial and will show how to introduce factorization.

The feed forward network created for the above tutorial will be relatively simple with 2 hidden layers (num_hidden_layers) with each layer having 50 hidden nodes (hidden_layers_dim).


In [4]:
# Figure 1
Image(url="http://cntk.ai/jup/feedforward_network.jpg", width=200, height=200)

Once the training is completed, we have a model, denoted as `z`, that can be optimized with factoriation. 
The two hidden layers in this network are fully connected and hence a good candidate for the factorization optimization. 
To optimize the network, we can follow the steps below. 

In [None]:
# import netopt module
import cntk.contrib.netopt.factorization as nc

# function determining the rank of the factorization.
def get_rank(W):    
    return int(len(W) * 0.6) # new rank will be 60% of the original

# a function to determine which layer to apply the optimization. Here we pick any fully connected layer.
def filter(model):
    W = model.W.value
    if (len(W) != len(W[0])):
        return False
    else:
        return True

# newz will have its second dense layer replaced with optimized dense layer. 
newz = nc.factor_dense(z, projection_function=get_rank, filter_function = filter)

### projection_function
This function determines the rank of the weights matrix(W) that the optimized network will have. 
For this demonstration, we use a simple rank function that returns 60% of the original rank of the weight matrix. 
The function can perform more complecated calculations to determine the rank.

### filter_function
This function provides an option to apply the factorization only to a selected dense layer. 
It can be used to select a layer based on the name or any other criteria. In the above example, we pick only the fully connected layer based on the weight matrix.

### Evaluating factorized network
Once the optimization is completed, we can evaluate the new model using the same evaluation steps of the original network as shown below.

In [None]:
out = C.softmax(newz)

predicted_label_probs = out.eval({input : features}) #evaluate the new model
print("Label    :", [np.argmax(label) for label in labels]) #original labels
print("Predicted:", [np.argmax(row) for row in predicted_label_probs])

## Binarization of convolution operations

This optimization, accessed via the `quantization.binarize_convolution(.)` function,  replaces `convolution` ops in the network, which typically operate on float values, with a lowered, user-defined `binary_convolution` op, which operates on bit values as per the [Quantized Neural Network approach of Courbarieaux et al.](https://arxiv.org/abs/1609.07061). 
Transforming convolution layers that processes the inputs to the entire network (quantizing the "entry" convolution operation) usually results in unacceptable performance degradation. 
Therefore, `quantization.binarize_convolution(.)` API requires a `filter_function` to properly select the convolution layers that needs to transformed. 
The current implementation quantizes both weights and inputs to every transformed layer to 1 bit. 
In the current release only CPUs with AVX support will see significant speedup of binarized networks.

The training of the network can happen after binarization but before the transformation of operations into optimized versions.
The API requires a `training_function`, which will be invoked after binarization.

Binarization makes extensive use of CNTK's user-level extensibility capability. Please see the [Extensibility section](../../../../../Examples/Extensibility/BinaryConvolution) for more details.

In [None]:
### Using Binarization

We will use [CNTK 103: Part D - Convolutional Neural Network with MNIST](https://cntk.ai/pythondocs/CNTK_103D_MNIST_ConvolutionalNeuralNetwork.html) tutorial to demonstrate the use of factorization.
We assume you have a working version of the code for the above tutorial and will show how to introduce binarization.

The convolution network created in the above example has two convolution layers followed by a dense layer, as shown below. 

In [8]:
# Figure 2
Image(url="https://www.cntk.ai/jup/cntk103d_convonly2.png", width=400, height=600)

Once the training is completed, we have a model, denoted as `z`, that can be optimized with quantization. 
The first convolution layer is connected to the input and hence we will not use it for optimization.
The next layer is a good candidate to perform binarization and optimization. For this, please follow the steps below.

In [None]:
import cntk.contrib.netopt.quantization as cq

# define a new training function. Note: this function accepts a network as an input parameter.
def do_train_and_test(model):
    reader_train = create_reader(train_file, True, input_dim, num_output_classes)
    reader_test = create_reader(test_file, False, input_dim, num_output_classes)
    train_test(reader_train, reader_test, model)

# create convolution network.
z = create_model(x)

# define a filter to determine which layer or layers need to be optimized.
def filter(x):
    return x.name == 'second_conv'

# optimized the network with binarization and native implementation.
optimized_z = cq.binarize_convolution(z, do_train_and_test, filter)

`binarize_convolution` performs the binarization, call the `do_train_and_test` function, and finally replace the convolution
layer with native Halide implementation. The resuling network, `optimized_z` can be used as a regular network for evaluation.
Please refer to [Run evaluation / prediction section](https://cntk.ai/pythondocs/CNTK_103D_MNIST_ConvolutionalNeuralNetwork.html#) of the tutorial for the next steps.

#### Seperate tarining step
The above mentioned API of quantization requires a training function to operate. `cntk.contrib.netopt.quantization` also provide two 
APIs that give more control over the convolution optimization to the user. 

In [None]:
# perform the binarization step only. This can be performed right after network creation
# and requires no training on the original network. filter select the convolution layer or layers
# to which the binarization is applied.
binz = cq.convert_to_binary_convolution(z, filter)

# perform training on binz network
# e.g. def do_train(binz)

# Convert the binarized model into Halide implementations
native_binz = cq.convert_to_native_binary_convolution(binz)