# Lecture Notes: The Inception Network Architecture (GoogleNet)

## I. Introduction and Context

*   The Inception Network architecture was developed by **Google** researchers.
*   The research paper, "Going Deeper with Convolutions," was released on September 17, 2014, making it older than ResNets.
*   A major focus of the paper was to **optimise the workflow** and network structure.
*   The network is also commonly referred to as **GoogleNet**. The capital "L" in GoogleNet is a tribute to **LeNet**, the first deep learning network using CNNs developed around 1998.
*   The architecture is known for being **very deep and wide**.

## II. Problems with Traditional CNNs (Plain Networks)

Inception Networks were designed to solve several drawbacks observed in traditional Convolutional Neural Networks (referred to as Plain Networks):

1.  **Overfitting:** Having **more layers** results in more kernels and thus **more trainable parameters**. If the available dataset is limited, very deep architectures have a high chance of **overfitting**.
2.  **Computational Cost:** A high number of layers and kernels requires extensive computations, leading to an increase in **computational and GPU cost**.
3.  **Fixed Kernel Size:** Traditional CNNs use a **fixed kernel size** (e.g., $3 \times 3$ or $5 \times 5$) within any single convolutional layer, transferring that output sequentially.
4.  **Single Path Network:** Information flows in a single, straight path from one layer to the next.
5.  **Vanishing Gradients:** In deep architectures lacking residual connections, there is a high risk of **vanishing gradients**.

## III. The Inception Module: Core Innovation

The Inception Module addresses the issue of fixed kernel size and the limits of a single-path network:

*   **Multi-Kernel Approach:** The fundamental idea is to ask: **"Why use only one kind of convolution?"**.
*   **Parallel Computation:** The Inception Module applies **multiple, different kernel types** to the input simultaneously (at a single stage). These typically include **$1 \times 1$ convolutions, $3 \times 3$ convolutions, $5 \times 5$ convolutions, and Max Pooling**.
*   **Information Capture:** Since each parallel path learns potentially new features, the combined output **captures more spatial information**. This approach creates a **multi-way path** and introduces **parallel computation**.
*   **Concatenation:** The outputs from all parallel convolutional and pooling operations are **concatenated**.
*   **Shape Requirement:** All parallel outputs (e.g., output 1, 2, 3, 4) must be of the **same shape/dimensions** to allow for concatenation. This is typically achieved by using `padding=same` and setting the stride appropriately.

## IV. Optimization Using $1 \times 1$ Convolutions

While the multi-kernel approach is effective, it is highly **computationally costly**. To address this, Google introduced a key optimization:

*   **The Problem:** Calculating the output from a single layer (e.g., $5 \times 5$ convolution) can require a huge number of multiplications (e.g., 120 million).
*   **The Solution:** The Inception Module introduces a **$1 \times 1$ convolution** layer immediately **before** the larger $3 \times 3$ and $5 \times 5$ convolutions.
*   **Dimensionality Reduction:** The $1 \times 1$ convolution acts as a **bottleneck**, drastically **reducing the depth/size** of the input layer before it hits the expensive, larger kernels.
*   **Computational Savings:** This simple software engineering trick reduces the total computational steps by approximately **ten times** (e.g., from 120 million to roughly 12 million multiplications in the given example).
*   **Efficiency Goal:** The aim is to achieve comparable or better performance with **fewer parameters** and less training time, making the model more practical for deployment.

## V. Auxiliary Networks (Deep Supervision)

In very deep architectures, waiting until the final output to calculate the loss can lead to unstable weight updates and slow convergence. Inception Nets introduced **Auxiliary Networks** to mitigate this:

*   **Intermediate Output:** Auxiliary Networks generate an intermediate output (prediction) by taking the output from an intermediate convolutional layer (e.g., the 4th and 7th Inception Modules) and passing it through a separate path featuring fully connected layers and a Softmax activation.
*   **Loss Calculation:** This generates intermediate loss functions (L1, L2).
*   **Total Loss:** The overall loss for the network is calculated as a **weighted average** of the final loss (L3) and the auxiliary losses (L1, L2), where the final loss typically receives a higher priority/weighting.
*   **Deep Supervision:** This system allows the **weights to be updated multiple times** during backpropagation (once for the final output, and once for each auxiliary path), ensuring weights in earlier layers are corrected more frequently and stabilising the training process.
*   **Inference Phase:** Auxiliary Networks are **only used during the training phase**. They are removed during the prediction or inference phase, where only the single, large, deep architecture is used.

## VI. Evolution of Inception Architecture (V1 to V4)

The Inception architecture underwent several key changes across its versions:

*   **Inception V1 (GoogleNet):** Featured the multi-kernel parallel structure ($1 \times 1, 3 \times 3, 5 \times 5$, Max Pool) and used **Local Response Normalization (LRN)** layers.
*   **Inception V2:** Eliminated the $5 \times 5$ convolution, replacing it with **two consecutive $3 \times 3$ convolutions**. It also further factorised the $3 \times 3$ convolutions into asymmetric convolutions (e.g., $1 \times 3$ and $3 \times 1$) to reduce parameters even more.
*   **Inception V3:** Replaced LRN layers with **Batch Normalization**. It also utilised optimizers like RMSprop.
*   **Inception V4:** Introduced a hybrid architecture that **combined Residual Networks (ResNets) with Inception Networks**.


<img src="https://i.ibb.co/fYmWv3k7/image.png">
<img src="https://i.ibb.co/vvK75BLW/image.png">
<img src="https://i.ibb.co/6cnD5R0b/image.png">

In [1]:
import tensorflow as tf
from tensorflow.keras import layers, models,Sequential
from tensorflow.keras.layers import Rescaling
import numpy as np

In [16]:
def inception_layer(x,filter_1_x_1,filter_3_x_3_reduce,filter_3_x_3,filter_5_x_5_reduce,filter_5_x_5,filters_pool_proj):
  # 1x1 layer conv
  conv_1x1=layers.Conv2D(filter_1_x_1,kernel_size=(1,1),strides=(1,1),padding='same',activation='relu')(x)

  # 3x3 layer conv
  conv_3x3=layers.Conv2D(filter_3_x_3_reduce,kernel_size=(1,1),strides=(1,1),padding='same',activation='relu')(x)
  conv_3x3=layers.Conv2D(filter_3_x_3,kernel_size=(3,3),strides=(1,1),padding='same',activation='relu')(conv_3x3)

  # 5x5 layer conv
  conv_5x5=layers.Conv2D(filter_5_x_5_reduce,kernel_size=(1,1),strides=(1,1),padding='same',activation='relu')(x)
  conv_5x5=layers.Conv2D(filter_5_x_5,kernel_size=(5,5),strides=(1,1),padding='same',activation='relu')(conv_5x5)

  # conv_1X1 after batchnorm
  pool_proj=layers.MaxPool2D(pool_size=(3,3),strides=(1,1),padding='same')(x)
  pool_proj=layers.Conv2D(filters_pool_proj,kernel_size=(1,1),strides=1,padding='same',activation='relu')(pool_proj)

  output=layers.concatenate([conv_1x1,conv_3x3,conv_5x5,pool_proj],axis=-1)
  return output

In [17]:
def auxillary_layer(x):
  x=layers.AveragePooling2D(pool_size=(5,5),strides=(3,3),padding='valid')(x)
  x=layers.Conv2D(128,kernel_size=(1,1),strides=(1,1),padding='same',activation='relu')(x)

  x=layers.Flatten()(x)
  x=layers.Dense(1024,activation='relu')(x)
  x=layers.Dropout(0.7)(x)
  x=layers.Dense(1000,activation='softmax')(x)

  return x

In [20]:
inputs=layers.Input(shape=(224,224,3))


x=layers.Conv2D(64,kernel_size=(7,7),strides=(2,2),padding='same',activation='relu')(inputs)
x=layers.MaxPool2D(pool_size=(3,3),strides=(2,2),padding='same')(x)
x=layers.BatchNormalization()(x)


x=layers.Conv2D(192,kernel_size=(3,3),strides=(1,1),padding='same',activation='relu')(x)
x=layers.MaxPool2D(pool_size=(3,3),strides=(2,2),padding='same')(x)
x=layers.BatchNormalization()(x)

x=inception_layer(x,64,96,128,16,32,32)
x=inception_layer(x,128,128,192,32,96,64)

x=layers.MaxPool2D(pool_size=(3,3),strides=(2,2),padding='same')(x)

x=inception_layer(x,192,96,208,16,48,64)

aux1=auxillary_layer(x)

x=inception_layer(x,160,112,224,24,64,64)
x=inception_layer(x,128,128,256,24,64,64)
x=inception_layer(x,112,144,288,32,64,64)

aux2=auxillary_layer(x)

x=inception_layer(x,256,160,320,32,128,128)
x=layers.MaxPool2D(pool_size=(3,3),strides=(2,2),padding='same')(x)

x=inception_layer(x,256,160,320,32,128,128)
x=inception_layer(x,384,192,384,48,128,128)

x=layers.AveragePooling2D(pool_size=(7,7),strides=(1,1),padding='valid')(x)
x=layers.Dropout(0.4)(x)

x=layers.Dense(1000,activation='softmax')(x)

model=models.Model(inputs,outputs=[x,aux1,aux2])

In [21]:
model.summary()