### CNN

https://cs231n.github.io/ → from Stanford University

### DL

https://mohitjain.me/category/deep-learning/

---

### Visual Cortex

a biological inspiration

* Convolutional Neural Networks (CNN, ConvNets) extremely popular form neural networks.

* CNN's have been specifically designed for visual tasks.
    - mnist
    - face detection
    - object recognition
        - machine learning
        - computer vision (intersection of image processing and ml/ai)

* **Why creating CNN instead of MLP?**
    - lots of image processing and computer vision tasks have been inspired by biological findings
        - there are some neurons that fire when presented specific objects ar specific angles
        - there are some neurons that detect 
            - edges
            - motion
            - depth
            - faces
            - shapes
    
    - **biological inspiration**
        <img src="https://www.frontiersin.org/files/Articles/34845/fpsyg-04-00124-HTML/image_m/fpsyg-04-00124-g001.jpg">
   
   <br>
   
    - **CNN**
        <img src="https://cdn-images-1.medium.com/fit/t/1600/480/1*vkQ0hXDaQv57sALXAJquxA.jpeg">

**Credits** - Images from Internet

### Convolution : Edge Detection

edge detection is a primary visual cortex

* **Convolution**
    - if given two matrices, convolution is a simple operation which includes element-wise multiplication following by addition
    - a generalization of dot product
    
    <img src="https://miro.medium.com/max/728/1*Fr6Umze2waDjWVHB2yzT4A.png">

* **Edge Detection**
    - we use `sobel` operator for detecting edges.
        - **horizontal**
            <pre>
            s_h = [[+1, +2, +1],
                   [ 0,  0,  0],
                   [-1, -2, -1]]
            </pre>
        
        - **vertical**
            <pre>
            s_v = [[+1,  0, -1],
                   [+2,  0, -2],
                   [+1,  0, -1]]
            </pre>
    
    - let $A$ be the original image matrix
    - let $G_x = S_h * A$ and $G_y = S_v * A$
    - the convolution (result) is $G = \sqrt{G_x^2 + G_y^2}$
    - we perform convolution using the above operators with the sub-matrices of the image; resultant image contains edges

In [1]:
# import cv2 as cv
# import numpy as np

# from matplotlib import pyplot as plt

In [2]:
def scale_image(image_mat):
    s_min = 0; s_max = 255
    image_min = image_mat.min()
    image_max = image_mat.max()
    scaled_image = ((image_mat - image_min) * (s_max - s_min)) / (image_max - image_min)
    return scaled_image

In [3]:
def break_submatrices(image_mat, square_kernel_size=3):
    orig_shape = image_mat.shape
    pw = square_kernel_size - 2
    
    image_pad = np.pad(array=image_mat, pad_width=pw)
    pimage_shape = image_pad.shape
    h_reduce, w_reduce = (pimage_shape[0] - orig_shape[0]), (pimage_shape[1] - orig_shape[1])
    
    flat_submatrices = np.array([
        image_pad[i:(i + 3), j:(j + 3)]
        for i in range(pimage_shape[0] - h_reduce) for j in range(pimage_shape[1] - w_reduce)
    ])
    
    return flat_submatrices

In [4]:
def sobel_operation(A, kernel):
    return np.sum(np.multiply(A, kernel))

def gradient_approximation(s_h, s_v):
    return np.sqrt(s_h**2 + s_v**2)

In [5]:
def sobel_convolution(image_mat, scale=True, show_plot=True):
    orig_shape = image_mat.shape
    
    mask = np.matrix([
        [+1, +2, +1],
        [ 0,  0,  0],
        [-1, -2, -1]
    ])
    k_h = mask
    k_v = mask.T
    
    flat_submatrices = break_submatrices(image_mat=image_mat)
    
    t_mat = []
    for each_mat in flat_submatrices:
        s_h = sobel_operation(A=each_mat, kernel=k_h)
        s_v = sobel_operation(A=each_mat, kernel=k_v)
        s_a = gradient_approximation(s_h=s_h, s_v=s_v)
        # s_a = s_h + s_v
        t_mat.append(s_a)
    
    eimage_mat = np.array(t_mat).reshape(orig_shape)
    if scale:
        eimage_mat = scale_image(image_mat=eimage_mat)
    
    if show_plot:
        cmap_val = 'gray'
        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 20))

        ax1.axis("off")
        ax1.title.set_text('Original')

        ax2.axis("off")
        ax2.title.set_text("Sobel")

        ax1.imshow(image_mat, cmap=cmap_val)
        ax2.imshow(eimage_mat, cmap=cmap_val)
        plt.show()
    
    return eimage_mat

In [6]:
# image_mat = cv.imread(filename='valve_original.png')
# image_mat = cv.cvtColor(src=image_mat, code=cv.COLOR_BGR2GRAY)
# eimage_mat = sobel_convolution(image_mat=image_mat, scale=True, show_plot=True)

![valve_ed](https://user-images.githubusercontent.com/63333753/140903158-6953525e-bca9-46af-9427-85eebb475e60.png)

### Padding & Strides

* **Padding** → to retain the original size of the image after convolution, we pad the image with `0` values around the image.

    - before padding
        - $n \ \text{x} \ n \implies \text{(conv with kernel)} \implies (n - k + 1) \ \text{x} \ (n - k + 1)$
        - reduces the size of the matrix after convolution
    - after padding
        - $n \ \text{x} \ n \implies \text{(conv with kernel + padding)} \implies (n - k + 2p + 1) \ \text{x} \ (n - k + 2p + 1)$
            - where $p$ is the padding width
            
        - retains the size of the matrix after convolution

* **Stride** → shifting the kernel along with the image matrix (to the exact size of the kernel) is called stride.
    - helps reduce the size of the image
    
    - before striding
        - $n \ \text{x} \ n \implies \text{(conv with kernel)} \implies (n - k + 1) \ \text{x} \ (n - k + 1)$
    - after striding
        - $n \ \text{x} \ n \implies \text{(conv with kernel + striding)} \implies \bigg[ \text{floor}\big(\frac{n - k}{s}\big) + 1 \bigg] \ \text{x} \ \bigg[ \text{floor}\big(\frac{n - k}{s}\big) + 1 \bigg]$
            - where $s$ is striding step

* Formulation in the case where both padding and striding is used -

$$n \ \text{x} \ n \implies \text{(conv with kernel + padding + striding)} \implies \bigg[ \text{floor}\bigg(\frac{n - k + 2p}{s}\bigg) + 1 \bigg] \ \text{x} \ \bigg[ \text{floor}\bigg(\frac{n - k + 2p}{s}\bigg) + 1 \bigg]$$

### Convolution : RGB Images

* A typical colored image is comprised of pixels (which are represented as RGB pixels).

* A pixel is simply a number in the range of 0 to 255 for all R, G, and B.
    - RGB are referred to as channels
    - **R → Red → 0 to 255**
    - **G → Green → 0 to 255**
    - **B → Blue → 0 to 255**
    
    <img src="https://raw.githubusercontent.com/msameeruddin/Data-Analysis-Python/main/8_DA_Image_da/rgb_demo.png">

<br>

* Some important colors and their RGB values -

| Pixel | R | G | B |
| --- | --- | --- | --- |
| **White** | 255 | 255 | 255 |
| **Red** | 255 | 0 | 0 |
| **Green** | 0 | 255 | 0 |
| **Blue** | 0 | 0 | 255 |
| **Black** | 0 | 0 | 0 |
| **Yellow** | 255 | 255 | 0 |

* All colors → https://www.colorhexa.com/color-names.

* A color image is simply known as a `tensor`.

* **Convolution** in a color image (`tensor`) is similar to convolution in grayscale images.
    - the kernel also becomes a `tenssor` in this case

### Convolution Layer

convolution with multiple kernels followed by ReLU

* In classical image processing, we have kernels to perform various operations on images.

* In a typical MLP, we try to learn the weights by backpropagation.

* We know that a visual cortex consists of various layers to detect various things.

* In a CNN, we try to learn kernels (for detecting various things).
    - something similar we do in MLP
    - kernels can be thought of weight matrices

* A convolution layer has 
    - a input (3D tensor) with 3 channels of size (n * n * c)
    - multiple kernels (all of them are same size) that is (k * k * c)
    - after which we finally get 3D tensor of size (n * n * m)
        - m corresponds to the number of kernels

* **How many kernels to take?**
    - it is a hyperparameter which needs to be tuned

* **Process**
    - we have a 3D tensor (RGB image) - input
    - the input convolves with the kernel matrix (k * k * 3)
    - then we perform elementwise ReLU activation function for each element
    - the resultant matrix is of size (n * n * m)
    
    <img src="https://www.researchgate.net/profile/Devis-Tuia/publication/323273293/figure/fig1/AS:614258178002946@1523461970440/Schematic-of-the-first-convolutional-layer-of-a-CNN-This-layer-learns-Nf-5-filters-of.png">

**Credits** - Image from Internet

### Max-Pooling

popular method in modern ConvNets

* Pooling is like a layers that can be added in the netowork.

* **Inspiration**
    - human beings can easily detect faces in the image easily
    - they can detect faces at any location in the image → **location invariant**
    - similarly, humans can easily detect eventhough the face is rotated (**rotation invariant**) or the face size is irregular (**scale invariant**)
    - if we want to replicate the same features a neural network, we must use `max pooling`

* Max-Pooling is again some sort of kernel which is doing kernel-trick.

* The aim of max-pooling operation is to consider only max pixel from the size of the kernel with respect to original image. This has to be continued along the entire image by the help of striding.

<img src="https://www.researchgate.net/publication/333593451/figure/fig2/AS:765890261966848@1559613876098/Illustration-of-Max-Pooling-and-Average-Pooling-Figure-2-above-shows-an-example-of-max.png">

* Max-Pooling is a operation that must be applied after (convolution + ReLU), so as to efficiently solve the task.

**Credits** - Image from Internet

> ### Convolution + ReLU + Max-Pooling → CNN

### CNN Training : Optimization

https://medium.com/@2017csm1006/forward-and-backpropagation-in-convolutional-neural-network-4dfa96d7b37e

https://becominghuman.ai/back-propagation-in-convolutional-neural-networks-intuition-and-code-714ef1c38199

* In MLP, with the help of backpropagation, we can learn the appropriate weights and for optimization, we can use 
    - SGD
    - AdaGrad
    - Adam
    - ...

* Similarly, as long as the convolution layer and max-pooling layer, if the loss is differentiable with respect to loss, we can train and optimize the CNN.

    - for sure the convolution layer is differentiable as it is very similar to MLP
    - ReLU is also differentiable
    - in max-pooling, for the max value, the derivative is 1. For the non-max value, the derivative is 1
        - in backpropagation, we only pass the derivative back to the maximum value

### LeNet

https://www.kaggle.com/blurredmachine/lenet-architecture-a-complete-guide

<img src="https://miro.medium.com/max/2625/1*1TI1aGBZ4dybR6__DI9dzA.png">

**Credits** - Image from Internet

### Data Augmentation

* Augmentation means adding some new stuff.

* To make sure CNN models are good at their task of recognizing or understandind the visual data, the original data is augmented (or changed the structure or added some noise etc).
    - flipping
    - mirroring
    - shifting
    - rotation
    - zooming (in and out)
    - sheering (streching)
    - blurring
    - noising
    - we can combine a lot of operations and come up with additional data
    
    <img src="https://www.kdnuggets.com/wp-content/uploads/cats-data-augmentation.jpg">

* **Why is data augmentation useful?**
    - by data augmentation, we can get various types of `invariants` (as we get in max-pooling)
    - the output doesn't change even if the input changes slightly
    - if we have a small dataset, we can get a much larger dataset

* Data augmentation is a popular technique especially in the field of computer vision.

**Credits** - Image from Internet

### Keras (CNN) Layers

* Convolution layers → https://keras.io/api/layers/convolution_layers/

* Pooling layers → https://keras.io/api/layers/pooling_layers/

* Keras core layers → https://keras.io/api/layers/core_layers/

* LeNet code architecture → https://github.com/DustinAlandzes/mnist-lenet-keras/blob/master/lenet.py

```python
from keras.models import Sequential
from keras.layers.convolutional import Convolution2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers.core import Activation
from keras.layers.core import Flatten
from keras.layers.core import Dense

class LeNet:
    @staticmethod
    def build(width, height, depth, classes, weightsPath=None):
        # initialize the model
        model = Sequential()

        # first set of CONV => RELU => POOL
        model.add(Convolution2D(20, 5, 5, border_mode="same",
            input_shape=(depth, height, width)))
        model.add(Activation("relu"))
        model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

        # second set of CONV => RELU => POOL
        model.add(Convolution2D(50, 5, 5, border_mode="same"))
        model.add(Activation("relu"))
        model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

        # set of FC => RELU layers
        model.add(Flatten())
        model.add(Dense(500))
        model.add(Activation("relu"))

        # softmax classifier
        model.add(Dense(classes))
        model.add(Activation("softmax"))

        # if weightsPath is specified load the weights
        if weightsPath is not None:
            model.load_weights(weightsPath)

        return model
```

### AlexNet

http://euler.stat.yale.edu/~tba3/stat665/lectures/lec18/notebook18.html

https://cs231n.github.io/convolutional-networks/

* AlexNet architecture

    <img src="https://i0.wp.com/ramok.tech/wp-content/uploads/2017/12/2017-12-31_01h31_40.jpg">

* Research paper → https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

**Credits** - Image from Internet

### VGGNet

https://arxiv.org/pdf/1409.1556.pdf

* Simplified version of AlexNet built by **Andrew Zisserman** and **team**.

* Two types of network
    - VGG-16 → 16 layered network
    - VGG-19 → 19 layered network

* Besides having different sizes of kernel (that we encounter in AlexNet), VGGNet comes with an idea of simplifying the layer by keeping same size (3 * 3) of kernel with stride 1 and paddinf as same for each convolution and max-pooling of size (2 * 2) and stride as 2.

* Today, VGGNet is the default network for lots of computer vision tasks.

* **Regularisation in VGG** : The training is regularised by weight decay (the L2 penalty multiplier set to `5*10^?4`) and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.5).

* Architecture
    - **workflow**
    
    <img src="https://qphs.fs.quoracdn.net/main-qimg-83c7dee9e8b039c3ca27c8dd91cacbb4">
    
    - **visual workflow**
    
    <img src="https://qphs.fs.quoracdn.net/main-qimg-ba81c87204be1a5d11d64a464bca39eb">

**Credits** - Images from Internet

### ResNet (Residual Network)

https://arxiv.org/pdf/1512.03385.pdf

* The problem with the `plain networks` (like VGGNet) is that, as the number of layers increase, both the training error and test error (even after many many iterations) happens to be worse comapred with the less layered network.
    - this case occurred even after using `dropouts` and `ReLU` etc

* The building block to solve the above problem is below.

![resnet-building-block](https://user-images.githubusercontent.com/63333753/141275207-166a9437-4a4a-42ae-b144-8a73a34e8913.png)

* We use the ideology of **skip connection** and thus get terrific results which solves the problem.

> $\text{ReLU}\big[\ \text{ReLU} \ (x)\ \big] = \text{ReLU} \ (x)$

* **Key Takeaways**

    - adding additional/new layers would not hurt performance as regularization will skip over them
    - if new layers are useful even in the presence of regularization, the weights or kernels will be non-zero

### Inception Network

https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202

### Transfer Learning

https://cs231n.github.io/transfer-learning/

* **Task** : Let's say that the task here is to classify dogs from cats. Of course, for solving this we can build a ConvNet (2 class classifier)

* **Idea** : Instead of building a convolutional neural network from scratch, what if we can reuse the existing models (VGG-16) which is trined on a different dataset.
    - Can the **learning** be **transferred** from one dataset to a different dataset to solve a different task altogether?
    
    <img src="https://datascience.aero/wp-content/uploads/2020/03/transferlearning-119.jpg">

> Keras has the complete model (VGG-16) trained on ImageNet dataset - (pre-trained) models.

* **Case - 1**
    - we can use the bottleneck features (features that come before the flattening layer)
        - we can train a simple linear (classifer) model on these bottleneck features
    - we use the existing model as a feature engineering tool
    - when the new dataset is very similar to ImageNet and is very small, then this technique has to be implemented

* **Case - 2**
    - we can simply finetune the last layers of the neural network (VGG-16) using the new dataset and freeze the earlier layers
        - while finetunning the model, we should ensure that the learning rate is small so as to make sure that the weights do not change too drastically
    - when the new dataset is medium-sized and also very similar to ImageNet, then this technique has to be implemented

* **Case - 3**
    - use the existing model (VGG-16) as an initialization model and finetune the whole model for the task
        - of course the learning rate should be small
    - when the new dataset is large and also very similar to ImageNet, then finetune the complete network

* **Case - 4**
    - dump the existing the models and build a new model from scratch
        - **NOT RECOMMEDED AT ALL**
    - when the new dataset is small and not similar to ImageNet, then the output of the earlier layers has to be used as features for the new task


**Credits** - Image from Internet

### Cats VS Dogs

https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html