The ResNet convolutional neural network
======================

In this tutorial I will introduce the ResNet architecture that has been designed by He et al. (2015) in the [article](https://arxiv.org/pdf/1512.03385.pdf) called *"Deep Residual Learning for Image Recognition"*. I will implement the network in tensorflow and I will train it on the CIFAR-10 dataset. ReSNets have been receiving attention in the last few years because they allow the designer to create very deep neural networks. It is important to rememeber that deep networks suffer of many problems, the most important being the [vanishing gradient](https://en.wikipedia.org/wiki/Vanishing_gradient_problem). The vanishing gradient can be intuitively understood thinking about a series of filters. When data pass through those filters a certain amount of information is lost. Deep models have many filters, meaning that there is a gradual deterioration of the data. Resnet solve the problem splitting a deep network into chunks. Each chunk receives the residual output of the previous chunck, plus the unfiltered input that has been passed to the previous chunk.


<p align="center">
<img src="../etc/img/resnet_block.png" width="400">
</p>


It is important to notice that the trick used in ResNet is to avoid the vanishing gradient instead of really solving it. Veit et al. (2016) pointed out in a [recent article](https://arxiv.org/pdf/1605.06431.pdf) that ResNets use a form of **ensemble by construction**, introducing short paths which can carry gradient throughout the extent of very deep networks. This idea has something in common with the **inception module**, where the layers are stacked in parallel instead than in sequential order. The functioning of a ResNet can be better visualized if we unravel the view of a simple 3-block architecture.

<p align="center">
<img src="../etc/img/resnet_path.png" width="850">
</p>

Each layer of a ResNet has a residual module $f_{i}$ and a skip connection that bypass the module. The residual module is a sequence of convolutions, ReLu, and pooling operations. The output of the residual module is computed as follows:

$$ y_{i} = f_{i}(y_{i-1}) + y_{i-1}$$




Implementing ReSNet in Tensorflow
--------------------------------------

The implementation in Tensorflow is straightforward and requires just a few modifications to a standard CNN. It is possible to implement different types of ResNets, but the depth is the main factor to take into account. In the original [article](https://arxiv.org/pdf/1512.03385.pdf) of He et al. (2015) five architectures are proposed, spanning from 18 to 152 layers.

<p align="center">
<img src="../etc/img/resnet_type.png" width="750">
</p>

The **convolutional layers** mostly have $3\times3$ filters. When the feature map size is halved, the number of filters is doubled. For instance we have 64 filters in conv2 and 128 in conv3. The downsampling is performed directly by the convolution operations, using a stride of 2.

The **identity** shortcuts (solid line in the picture below) can be directly used when the input and output are of the same dimensions. One of the critical point is the passage between convolutional layers of different size. When the dimensions increase the shortcut still performs identity mapping, with extra zero padding for increasing dimensions (no extra parameters). This particular shortcuts are identified by dotted shortcuts in the image below.

<p align="center">
<img src="../etc/img/resnet_shortcuts.png" width="500">
</p>


In the original work SGD was used to train the network. The dropout was not applied. Batch normalization was used right after each convolution.

Here I will create a ResNet model function that we can easily use into a Tensorflow Estimator framework. For the sake of clarity I will declare each single layer of the network, however it would be better to use a function that can recursively build up the model.

In [None]:
def my_model_fn(features, labels, mode):
    #Defining the CNN model
    
    input_layer = tf.reshape(features, [-1, 32, 32, 1])
    
    #Input: 32x32, Output: 16x16
    with tf.variable_scope("residual_module_2"):
        #First convolution of the unit
        conv2_1 = tf.layers.conv2d(inputs=conv1, filters=64, kernel_size=[3, 3], 
                                   padding="same", use_bias=False, activation=None) #no ReLu activation here
        conv2_1 = tf.layers.batch_normalization(conv2_1) #apply batch norm
        conv2_1 = tf.nn.relu(conv2_1) #ReLu after the batch norm
        #Second convolution of the unit
        conv2_2 = tf.layers.conv2d(inputs=conv2_1, filters=64, kernel_size=[3, 3], 
                                   padding="same", use_bias=False, activation=None)
        conv2_2 = tf.layers.batch_normalization(conv2_2)
        conv2_2 = tf.nn.relu(conv2_2)
        conv2_2_res = conv2_2 + conv1 #<----- Residual is applied (32x32)
        #Third convolution of the unit
        conv2_3 = tf.layers.conv2d(inputs=conv2_2_res, filters=64, kernel_size=[3, 3],
                                   padding="same", use_bias=False, activation=None)
        conv2_3 = tf.layers.batch_normalization(conv2_3)
        conv2_3 = tf.nn.relu(conv2_3)
        #Fourth convolution of the unit (stride of 2 is applied)
        conv2_4 = tf.layers.conv2d(inputs=conv2_3, filters=64, kernel_size=[3, 3], stride=2,
                                   padding="same", use_bias=False, activation=None)
        conv2_4 = tf.layers.batch_normalization(conv2_4)
        conv2_4 = tf.nn.relu(conv2_4)
        conv2_4_res = conv2_4 + conv2_2_res #<----- Residual is applied (32x32)

    #Input: 16x16, Output: 8x8
    with tf.variable_scope("residual_module_3"):
        #First convolution of the unit
        #TODO: check how to deal with the input here
        conv3_1 = tf.layers.conv2d(inputs=conv2_4_res, filters=64, kernel_size=[3, 3],
                                   padding="same", use_bias=False, activation=None) #no ReLu activation here
        conv3_1 = tf.layers.batch_normalization(conv3_1) #apply batch norm
        conv3_1 = tf.nn.relu(conv3_1) #ReLu after the batch norm
        #Second convolution of the unit
        conv3_2 = tf.layers.conv2d(inputs=conv3_1, filters=64, kernel_size=[3, 3], 
                                   padding="same", use_bias=False, activation=None)
        conv3_2 = tf.layers.batch_normalization(conv3_2)
        conv3_2 = tf.nn.relu(conv3_2)
        #TODO: check how to deal with residual here
        #conv3_2_res = conv3_2 + conv2_4_res #<----- Residual is applied (16x16 + 32x32)
        #Third convolution of the unit
        conv3_3 = tf.layers.conv2d(inputs=conv3_2_res, filters=64, kernel_size=[3, 3],
                                   padding="same", use_bias=False, activation=None)
        conv3_3 = tf.layers.batch_normalization(conv3_3)
        conv3_3 = tf.nn.relu(conv3_3)
        #Fourth convolution of the unit (stride of 2 is applied)
        conv3_4 = tf.layers.conv2d(inputs=conv3_3, filters=64, kernel_size=[3, 3], stride=2,
                                   padding="same", use_bias=False, activation=None)
        conv3_4 = tf.layers.batch_normalization(conv3_4)
        conv3_4 = tf.nn.relu(conv3_4)
        #TODO: check how to apply residual here
        #conv3_4_res = conv3_4 + conv3_2_res #<----- Residual is applied (16x16 + 32x32 ???)
        
        
        
    
    #PREDICT mode
    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions = {"classes": tf.argmax(input=logits, axis=1),
                       "probabilities": tf.nn.softmax(logits, name="softmax_tensor"),
                       "logits": logits}
        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
    #TRAIN mode
    elif mode == tf.estimator.ModeKeys.TRAIN:
        loss = tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=logits)
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
        train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
        accuracy = tf.metrics.accuracy(labels=tf.argmax(labels, axis=1), predictions=tf.argmax(logits, axis=1))
        tf.summary.scalar('accuracy', accuracy[1]) #<-- accuracy[1] to grab the value
        tf.summary.image("input_features", tf.reshape(features, [-1, 32, 32, 1]), max_outputs=3)
        tf.summary.image("c1_k1_feature_maps", tf.reshape(c1[:, :, :, 0], [-1, 28, 28, 1]), max_outputs=3) #c1 kernel-1
        tf.summary.image("c3_k2_feature_maps", tf.reshape(c3[:, :, :, 0], [-1, 10, 10, 1]), max_outputs=3) #c3 kernel-1
        logging_hook = tf.train.LoggingTensorHook({"accuracy" : accuracy[1]}, every_n_iter=200)
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op, training_hooks =[logging_hook])
    #EVAL mode
    elif mode == tf.estimator.ModeKeys.EVAL:
        loss = tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=logits)
        accuracy = tf.metrics.accuracy(labels=tf.argmax(labels, axis=1), predictions=tf.argmax(logits, axis=1))
        eval_metric = {"accuracy": accuracy}
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric)