![](https://www.domsoria.com/wp-content/uploads/2019/11/keras.png)

# Inception Network

The Inception network was an important milestone in the development of CNN classifiers. Prior to its inception (pun intended), most popular CNNs just stacked convolution layers deeper and deeper, hoping to get better performance.

The Inception network on the other hand, was complex (heavily engineered). It used a lot of tricks to push performance; both in terms of speed and accuracy. Its constant evolution lead to the creation of several versions of the network. The popular versions are as follows
* Inception v1
* Inception v2
* Inception v3
* Inception v4
* Inception ResNet

## Which problems lead to Inception networks? 

Before digging into Inception Networks (yes, the name comes from the movie), it might be useful studying why an inception network was needed.

#### Fun Fact
The meme is referred in the [original paper](https://arxiv.org/pdf/1409.4842v1.pdf).
![](https://miro.medium.com/max/1024/1*cwR_ezx0jliDvVUV6yno5g.jpeg)

### Problems
1. Salient parts in the image can have extremely large variation in size. For instance, an image with a dog can be either of the following, as shown below. The area occupied by the dog is different in each image.
![](https://miro.medium.com/max/1400/1*aBdPBGAeta-_AM4aEyqeTQ.jpeg)
2. Because of this huge variation in the location of the information, choosing the right kernel size for the convolution operation becomes tough. A larger kernel is preferred for information that is distributed more globally, and a smaller kernel is preferred for information that is distributed more locally.
3. Very deep networks are prone to overfitting. It also hard to pass gradient updates through the entire network.
4. Naively stacking large convolution operations is computationally expensive.

### Solution!

__Inception module__

Why not have filters with multiple sizes operate on the same level? The network essentially would get a bit “wider” rather than “deeper”. The authors designed the inception module to reflect the same.

![](https://i.ytimg.com/vi/KfV8CJh7hE0/maxresdefault.jpg)

## Inception v1

![](images/Inception_module1.jpg)

Consider as example just the $5\times5$ convolution,

![](images/Inception_module2.jpg)


As stated before, deep neural networks are **computationally expensive**. To make it cheaper, the authors limit the number of input channels by adding an extra $1\times1$ convolution before the $3\times3$ and $5\times5$ convolutions. Though adding an extra operation may seem counterintuitive, $1\times1$ convolutions are far more cheaper than $5\times5$ convolutions, and the reduced number of input channels also help. Do note that however, the $1\times1$ convolution is introduced after the max pooling layer, rather than before.

![](images/Inception_module3.jpg)

### GoogLeNet

Using dimensional reduction a very populsr architecture was built, _GoogLeNet_.

<img src="https://miro.medium.com/max/1400/1*uW81y16b-ptBDV8SIT1beQ.png" style="width:750px;height:300px;">
<center><small> GoogLeNet. The orange box is the stem, which has some preliminary convolutions. The purple boxes are auxiliary classifiers. The wide parts are the inception modules. (Source: Inception v1)<\small><\center>

GoogLeNet has 9 inception modules stacked linearly. It is 22 layers deep (27, including the pooling layers). It uses global average pooling at the end of the last inception module.

Of course, it is a pretty deep classifier. As with any very deep network, it is subject to the vanishing gradient problem.
To prevent the middle part of the network from “dying out”, the authors introduced two auxiliary classifiers (The purple boxes in the image). They essentially applied softmax to the outputs of two of the inception modules, and computed an auxiliary loss over the same labels. The total loss function is a weighted sum of the auxiliary loss and the real loss. Weight value used in the paper was 0.3 for each auxiliary loss.

## Inception v2 and v3

These two further version of the inception module were presented in the same [paper](https://arxiv.org/pdf/1512.00567v3.pdf).

### The Premise:
Reduce representational bottleneck. The intuition was that, neural networks perform better when convolutions didn’t alter the dimensions of the input drastically. Reducing the dimensions too much may cause loss of information, known as a “representational bottleneck”

Using smart factorization methods, convolutions can be made more efficient in terms of computational complexity.

### Solution
![](https://miro.medium.com/max/1114/1*RzvmmEQH_87qKWYBFIG_DA.png)

* Authors noticed how a $5\times5$ convolution is more than two times more expensive than a $3\times3$ convolution: hence they substituted a $5\times5$ block with two stacked $3\times3$.

* Moreover, they factorize convolutions of filter size $n\times n$ to a combination of $1\times n$ and $n\times 1$ convolutions. For example, a $3\times 3$ convolution is equivalent to first performing a $1\times 3$ convolution, and then performing a $3\times 1$ convolution on its output. They found this method to be $33\%$ more cheaper than the single $3\times 3$ convolution. This is illustrated in the below image.

![](https://miro.medium.com/max/1196/1*hTwo-hy9BUZ1bYkzisL1KA.png)

All of this stack together leads to the following scheme

![](https://miro.medium.com/max/1150/1*DVXTxBwe_KUvpEs3ZXXFbg.png)


## Another important point: Average pooling

![](https://miro.medium.com/max/1400/1*0-wMHcASLDFzx9YBRCZXHg.png)

Previously, we have seen fully connected (FC) layers are used at the end of network, such as in AlexNet. All inputs are connected to each output.

* FC: __Number of weights (connections) = 7×7×1024×1024 = 51.3M__

* GoogLeNet: global average pooling is used nearly at the end of network by averaging each feature map from 7×7 to 1×1, as in the figure above.
 __Number of weights = 0__
 
And authors found that a move from FC layers to average pooling improved the top-1 accuracy by about 0.6%.

This makes the final layer have a regularisation effect, and GoogLeNet is less prone to overfit.

## Exercise: Implementation of GoogLeNet in Keras

We will use the CIFAR-10 dataset for this purpose.

CIFAR-10 (Canadian Institute For Advanced Research) is a popular image classification dataset. It consists of 60,000 images of 10 classes (each class is represented as a row in the above image). The dataset is divided into 50,000 training images and 10,000 test images.

The dataset is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains $60,000$ $32\times32$ color images in $10$ different classes. The 10 different classes can be represented by a list 
```
classes = ['airplanes', 'cars', 'birds', 'cats', 'deer', 'dogs', 'frogs', 'horses', 'ships', 'trucks']
``` 
There are $6,000$ images of each class.

Computer algorithms for recognizing objects in photos often learn by example. CIFAR-10 is a set of images that can be used to teach a computer how to recognize objects. Since the images in CIFAR-10 are low-resolution ($32\times32$), this dataset can allow researchers to quickly try different algorithms to see what works. Various kinds of convolutional neural networks tend to be the best at recognizing the images in CIFAR-10.

CIFAR-10 is a labeled subset of the $80$ million tiny images dataset. When the dataset was created, students were paid to label all of the images.

In [1]:
import tensorflow as tf
from tensorflow.keras.layers import Layer, Lambda
import tensorflow.keras.backend as K
from tensorflow.keras.datasets import cifar10

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Conv2D, MaxPool2D, Dropout, Dense, Input, concatenate, GlobalAveragePooling2D, AveragePooling2D, Flatten

import numpy as np 
from tensorflow.keras import backend as K 
from tensorflow.keras.utils import to_categorical

import math 
from tensorflow.keras.optimizers import SGD 
from tensorflow.keras.callbacks import LearningRateScheduler

First of all let's define a function to load the dataset in the proper form.

In [2]:
num_classes = 10

def load_cifar10_data():

    # Load cifar10 training and validation sets
    (X_train, Y_train), (X_valid, Y_valid) = cifar10.load_data()

    # Transform targets to keras compatible format
    Y_train = to_categorical(Y_train, num_classes)
    Y_valid = to_categorical(Y_valid, num_classes)
    
    X_train = X_train.astype('float32')
    X_valid = X_valid.astype('float32')
    
    # Rescale training images
    X_train /=255.
    X_valid /=255.

    return X_train, Y_train, X_valid, Y_valid

In [None]:
X_train, y_train, X_test, y_test = load_cifar10_data(224, 224)

Now, we will define our deep learning architecture. We will quickly define a function to do this, which, when given the necessary information, gives us back the entire inception layer.

In [None]:
def inception_module(x,
                     filters_1x1,
                     filters_3x3_reduce,
                     filters_3x3,
                     filters_5x5_reduce,
                     filters_5x5,
                     filters_pool_proj,
                     name=None):
    
    conv_1x1 = Conv2D(filters_1x1, (1, 1), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(x)
    
    conv_3x3 = Conv2D(filters_3x3_reduce, (1, 1), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(x)
    conv_3x3 = Conv2D(filters_3x3, (3, 3), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(conv_3x3)

    conv_5x5 = Conv2D(filters_5x5_reduce, (1, 1), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(x)
    conv_5x5 = Conv2D(filters_5x5, (5, 5), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(conv_5x5)

    pool_proj = MaxPool2D((3, 3), strides=(1, 1), padding='same')(x)
    pool_proj = Conv2D(filters_pool_proj, (1, 1), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(pool_proj)

    output = concatenate([conv_1x1, conv_3x3, conv_5x5, pool_proj], axis=3, name=name)
    
    return output

We will then create the GoogLeNet architecture, as mentioned in the paper.

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/10/Screenshot-from-2018-10-16-11-56-41.png)

In [5]:
kernel_init = tf.keras.initializers.GlorotUniform()
bias_init = tf.keras.initializers.Constant(value=0.2)

In [6]:
# the model

input_layer = Input(shape=(32, 32, 3))
input_resize = Lambda(lambda image: tf.image.resize(image, (224, 224)))(input_layer)

x = Conv2D(64, (7, 7), padding='same', strides=(2, 2), activation='relu', name='conv_1_7x7/2', kernel_initializer=kernel_init, bias_initializer=bias_init)(input_resize)
x = MaxPool2D((3, 3), padding='same', strides=(2, 2), name='max_pool_1_3x3/2')(x)
x = Conv2D(64, (1, 1), padding='same', strides=(1, 1), activation='relu', name='conv_2a_3x3/1')(x)
x = Conv2D(192, (3, 3), padding='same', strides=(1, 1), activation='relu', name='conv_2b_3x3/1')(x)
x = MaxPool2D((3, 3), padding='same', strides=(2, 2), name='max_pool_2_3x3/2')(x)

x = inception_module(x,
                     filters_1x1=64,
                     filters_3x3_reduce=96,
                     filters_3x3=128,
                     filters_5x5_reduce=16,
                     filters_5x5=32,
                     filters_pool_proj=32,
                     name='inception_3a')

x = inception_module(x,
                     filters_1x1=128,
                     filters_3x3_reduce=128,
                     filters_3x3=192,
                     filters_5x5_reduce=32,
                     filters_5x5=96,
                     filters_pool_proj=64,
                     name='inception_3b')

x = MaxPool2D((3, 3), padding='same', strides=(2, 2), name='max_pool_3_3x3/2')(x)

x = inception_module(x,
                     filters_1x1=192,
                     filters_3x3_reduce=96,
                     filters_3x3=208,
                     filters_5x5_reduce=16,
                     filters_5x5=48,
                     filters_pool_proj=64,
                     name='inception_4a')


x1 = AveragePooling2D((5, 5), strides=3)(x)
x1 = Conv2D(128, (1, 1), padding='same', activation='relu')(x1)
x1 = Flatten()(x1)
x1 = Dense(1024, activation='relu')(x1)
x1 = Dropout(0.7)(x1)
x1 = Dense(10, activation='softmax', name='auxilliary_output_1')(x1)

x = inception_module(x,
                     filters_1x1=160,
                     filters_3x3_reduce=112,
                     filters_3x3=224,
                     filters_5x5_reduce=24,
                     filters_5x5=64,
                     filters_pool_proj=64,
                     name='inception_4b')

x = inception_module(x,
                     filters_1x1=128,
                     filters_3x3_reduce=128,
                     filters_3x3=256,
                     filters_5x5_reduce=24,
                     filters_5x5=64,
                     filters_pool_proj=64,
                     name='inception_4c')

x = inception_module(x,
                     filters_1x1=112,
                     filters_3x3_reduce=144,
                     filters_3x3=288,
                     filters_5x5_reduce=32,
                     filters_5x5=64,
                     filters_pool_proj=64,
                     name='inception_4d')


x2 = AveragePooling2D((5, 5), strides=3)(x)
x2 = Conv2D(128, (1, 1), padding='same', activation='relu')(x2)
x2 = Flatten()(x2)
x2 = Dense(1024, activation='relu')(x2)
x2 = Dropout(0.7)(x2)
x2 = Dense(10, activation='softmax', name='auxilliary_output_2')(x2)

x = inception_module(x,
                     filters_1x1=256,
                     filters_3x3_reduce=160,
                     filters_3x3=320,
                     filters_5x5_reduce=32,
                     filters_5x5=128,
                     filters_pool_proj=128,
                     name='inception_4e')

x = MaxPool2D((3, 3), padding='same', strides=(2, 2), name='max_pool_4_3x3/2')(x)

x = inception_module(x,
                     filters_1x1=256,
                     filters_3x3_reduce=160,
                     filters_3x3=320,
                     filters_5x5_reduce=32,
                     filters_5x5=128,
                     filters_pool_proj=128,
                     name='inception_5a')

x = inception_module(x,
                     filters_1x1=384,
                     filters_3x3_reduce=192,
                     filters_3x3=384,
                     filters_5x5_reduce=48,
                     filters_5x5=128,
                     filters_pool_proj=128,
                     name='inception_5b')

x = GlobalAveragePooling2D(name='avg_pool_5_3x3/1')(x)

x = Dropout(0.4)(x)

x = Dense(10, activation='softmax', name='output')(x)

In [7]:
model = Model(input_layer, [x, x1, x2], name='inception_v1')

In [8]:
model.summary()

Model: "inception_v1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 224, 224, 3)  0                                            
__________________________________________________________________________________________________
conv_1_7x7/2 (Conv2D)           (None, 112, 112, 64) 9472        input_1[0][0]                    
__________________________________________________________________________________________________
max_pool_1_3x3/2 (MaxPooling2D) (None, 56, 56, 64)   0           conv_1_7x7/2[0][0]               
__________________________________________________________________________________________________
conv_2a_3x3/1 (Conv2D)          (None, 56, 56, 64)   4160        max_pool_1_3x3/2[0][0]           
_______________________________________________________________________________________

The model looks fine, as you can gauge from the above output. 
We can add a few finishing touches before we train our model. We will define the following:
* Loss function for each output layer
* Weightage assigned to that output layer
* Optimization function, which is modified to include a weight decay after every 8 epochs
* Evaluation metric

In [9]:
epochs = 25
initial_lrate = 0.01

def decay(epoch, steps=100):
    initial_lrate = 0.01
    drop = 0.96
    epochs_drop = 8
    lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))
    return lrate

sgd = SGD(lr=initial_lrate, momentum=0.9, nesterov=False)

lr_sc = LearningRateScheduler(decay, verbose=1)

model.compile(loss=['categorical_crossentropy', 'categorical_crossentropy', 'categorical_crossentropy'], loss_weights=[1, 0.3, 0.3], optimizer=sgd, metrics=['accuracy'])

Our model is now ready to fit.

In [None]:
history = model.fit(X_train, [y_train, y_train, y_train], validation_data=(X_test, [y_test, y_test, y_test]), 
                    epochs=epochs, batch_size=256, callbacks=[lr_sc])

Train on 50000 samples, validate on 10000 samples
Epoch 1/25

Epoch 00001: LearningRateScheduler setting learning rate to 0.01.
Epoch 2/25

Epoch 00002: LearningRateScheduler setting learning rate to 0.01.
Epoch 3/25

Epoch 00003: LearningRateScheduler setting learning rate to 0.01.
Epoch 4/25

Epoch 00004: LearningRateScheduler setting learning rate to 0.01.
Epoch 5/25

Epoch 00005: LearningRateScheduler setting learning rate to 0.01.
Epoch 6/25

Epoch 00006: LearningRateScheduler setting learning rate to 0.01.
Epoch 7/25

Epoch 00007: LearningRateScheduler setting learning rate to 0.01.
Epoch 8/25

Epoch 00008: LearningRateScheduler setting learning rate to 0.0096.
Epoch 9/25

Epoch 00009: LearningRateScheduler setting learning rate to 0.0096.
Epoch 10/25

Epoch 00010: LearningRateScheduler setting learning rate to 0.0096.
Epoch 11/25

Epoch 00011: LearningRateScheduler setting learning rate to 0.0096.
Epoch 12/25

Epoch 00012: LearningRateScheduler setting learning rate to 0.0096.
E

## Inception-ResNet

Inspired by the performance of the ResNet, a hybrid inception module was proposed. There are two sub-versions of Inception ResNet, namely v1 and v2. The former has computational performances similat to Inception v3, while the latter to Inception v4.

Here a scheme of the inception block with skip connections.

![](https://miro.medium.com/max/1400/1*WyqyCKA4mP1jsl8H4eHrjg.jpeg)

## To go deeper 😉

One can read this excellent [post](https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202).

## Bibliography

* [Inception paper](https://arxiv.org/pdf/1409.4842v1.pdf)
* [Inception v2 and v3 paper](https://arxiv.org/pdf/1512.00567v3.pdf)
* [Inception v4 and Inception-ResNet paper](https://arxiv.org/pdf/1602.07261.pdf)