# Inception & its variants

Inception-v4, evolved from GoogLeNet / Inception-v1, has a more uniform simplified architecture and more inception modules than Inception-v3. Inception network with residual connections, an idea proposed by Microsoft ResNet, outperforms similarly expensive Inception network without residual connections. With ensemble of 1 Inception-v4 and 3 residual networks, 3.08% error can be achieved in ILSVRC classification task.

### Inception-v1 / GoogLeNet : Inception Module

Inception module was firstly introduced in Inception-v1 / GoogLeNet. The input goes through 1×1, 3×3 and 5×5 conv, as well as max pooling simultaneously and concatenated together as output. Thus, we don't need to think of which filter size should be used at each layer.

### Inception-v2 / BN-Inception : Batch  Normalization

Batch normalization (BN) was introduced in Inception-v2 / BN-Inception. ReLU is used as activation function to address the saturation problem and the resulting vanishing gradients. But it also makes the output more irregular. It is advantageous for the distribution of X to remain fixed over time because a small change will be amplified when network goes deeper. Higher learning rate can be used.
Also, 5×5 conv was replaced by two 3×3 convs for parameter reduction while maintaining the receptive field size.

###  Inception-v3 : Factorization

#### Factorizing Convolutions
The aim of factorizing Convolutions is to reduce the number of connections/parameters without decreasing the network efficiency.

Factorization Into Smaller Convolutions
Two 3×3 convolutions replaces one 5×5 convolution as follows:

By using 1 layer of 5×5 filter, number of parameters = 5×5=25
By using 2 layers of 3×3 filters, number of parameters = 3×3+3×3=18
Number of parameters is reduced by 28%

Similar technique has been mentioned in VGGNet already.
With this technique, one of the new Inception modules becomes:

![](./fig/IMA.png)

Factorization Into Asymmetric Convolutions
One 3×1 convolution followed by one 1×3 convolution replaces one 3×3 convolution as follows:

![](./fig/3x1.png)

By using 3×3 filter, number of parameters = 3×3=9
By using 3×1 and 1×3 filters, number of parameters = 3×1+1×3=6
Number of parameters is reduced by 33%
You may ask why we don’t use two 2×2 filters to replace one 3×3 filter?
If we use two 2×2 filters, number of parameters = 2×2×2=8
Number of parameters is only reduced by 11%
With this technique, one of the new Inception modules (I call it Inception Module B here) becomes:


![](./fig/IMB.png)

And Inception module C is also proposed for promoting high dimensional representations according to author descriptions as follows:

![](./fig/IMC.png)

Thus, authors suggest these 3 kinds of Inception Modules. With factorization, number of parameters is reduced for the whole network, it is less likely to be overfitting, and consequently, the network can go deeper!

#### Auxiliary Classifier

Auxiliary Classifiers were already suggested in GoogLeNet / Inception-v1. There are some modifications in Inception-v3.
Only 1 auxiliary classifier is used on the top of the last 17×17 layer, instead of using 2 auxiliary classifiers. (The overall architecture would be shown later.)

![](./fig/auxilary.png)


The purpose is also different. In GoogLeNet / Inception-v1, auxiliary classifiers are used for having deeper network. In Inception-v3, auxiliary classifier is used as regularizer. So, actually, in deep learning, the modules are still quite intuitive.
Batch normalization, suggested in Inception-v2, is also used in the auxiliary classifier.

#### Efficient Grid Size Reduction
Conventionally, such as AlexNet and VGGNet, the feature map downsizing is done by max pooling. But the drawback is either too greedy by max pooling followed by conv layer, or too expensive by conv layer followed by max pooling. Here, an efficient grid size reduction is proposed as follows:

![](./fig/down.png)

With the efficient grid size reduction, 320 feature maps are done by conv with stride 2. 320 feature maps are obtained by max pooling. And these 2 sets of feature maps are concatenated as 640 feature maps and go to the next level of inception module.
Less expensive and still efficient network is achieved by this efficient grid size reduction.

### Inception-v3 Architecture

There are some typos for the architecture in the passage and table within the paper. I believe this is due to the intense ILSVRC competition in 2015. I thereby look into the codes to realize the architecture

![](./fig/incepv3.png)

With 42 layers deep, the computation cost is only about 2.5 higher than that of GoogLeNet [4], and much more efficient than that of VGGNet


#### Label Smoothing As Regularization

The purpose of label smoothing is to prevent the largest logit from becoming much larger than all others:
new_labels = (1 — ε) * one_hot_labels + ε / K
where ε is 0.1 which is a hyperparameter and K is 1000 which is the number of classes. A kind of dropout effect observed in classifier layer.

### Inception-v4
A more uniform simplified architecture and more inception modules than Inception-v3, is introduced as below:

![](./fig/iv4.png)

This is a pure Inception variant without any residual connections. It can be trained without partitioning the replicas, with memory optimization to backpropagation.
We can see that the techniques from Inception-v1 to Inception-v3 are used. (Batch Normalization is also used but not shown in the figure.)

### Inception-Resnet-v1

By using the above versions of Inception-A, Inception-B and Inception-C, we can have Inception-ResNet-v1. We can see that there is a shortcut connection at the left of each module. This shortcut connection has been a kind of proved that it can help go deeper in ResNet

![](./fig/inception-resnet.png)

It has roughly the computational cost of Inception-v3. Inception-Resnet-v1 was training much faster, but reached slightly worse final accuracy than Inception-v3.
However, the ReLU used after adding together makes Inception network not able to go further deeper. In the paper, authors also mentioned that if the number of filters exceeded 1000, the residual variants started to exhibit instabilities, and the network just “died” early during training.
To stabilize the training, scaling down the residuals before adding them to the previous layer activation. In general some scaling factors are picked between 0.1 and 0.3 to scale the residuals before being added to the accumulated layer activations

### Inception-Resnet-v2 

With the whole network schema using the one in Inception-ResNet-v1, Stem using the one in Inception-v4, and the above versions of Inception-A, Inception-B and Inception-C, we can have Inception-ResNet-v2. Again, there is a shortcut connection at the left of each module.

![](./fig/inception-resnetv2.png)

It has roughly the computational cost of Inception-v4. Inception-ResNet-v2 was training much faster and reached slightly better final accuracy than Inception-v4.