# Introduction to Convnets

A convnet takes as input tensors of shape (image_height, image_width, image_channels). In the following example, we’ll configure the convnet to process inputs of size (28, 28, 1), which is the format of MNIST images. We’ll do this by passing the argument input_shape=(28, 28, 1) to the first layer.

In [1]:
from keras import layers 
from keras import models
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) 
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu')) 
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))


2023-04-03 11:15:48.739926: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-03 11:15:49.377449: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/administrator/miniconda3/envs/tf/lib/
2023-04-03 11:15:49.377521: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/administrator/miniconda3/envs/tf/lib/
2023-04-03 11:15:51.399555: I tensorflow

## Convolutional Neural Network

- In deep learning, a convolutional neural network (CNN) is a class of artificial neural network most commonly applied to analyze visual imagery.
- CNNs use a mathematical operation called convolution in place of general matrix multiplication in at least one of their layers.
- They are specifically designed to process pixel data and are used in image recognition and processing. 
- They have applications in image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, natural language processing, brain–computer interfaces, and financial time series.

### The convolution operation:

Convolution layers learn local patterns: in the case of images, patterns found in small 2D windows of the inputs.



![convolution](./convolution.png)

A first convolution layer will learn small local patterns such as edges, a second convolution layer will learn larger patterns made of the features of the first layers, and so on. This allows convnets to efficiently learn increasingly complex and abstract visual concepts (because the visual world is fundamentally spatially hierarchical).

![convolution](./spatial_hierarchy.png)

![convolution](./conv1-1-750x456.png)

Output Pixel value = (w1*px1 + w2*px2 + w3*px3 + w4*px4 + w5*px5 + w6*px6) + bias

![convolution](./conv2-1.gif)


- One important thing to note here is that Convolutions occur per channel. An input image would generally consist of three channels: red, green, and blue. 
- The above example shows convolution operation happening over 2-dimension input image, however actual image is represented in 3-dimensions! 
- Depth = "channels" that represent the dimension of the image in 3D. 
- The Kernel is 2D, so we apply the 2D kernel to all two dimensions of 3D input image. Then, all dimensions of post kernel operation are merged to get output.

### Activation 

- The activation function of a node defines the output of that node given an input or set of inputs. 
- A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or "OFF" (0), depending on input. 
    - This is similar to the linear perceptron in neural networks. 
- However, only nonlinear activation functions allow such networks to compute nontrivial problems using only a small number of nodes, and such activation functions are called nonlinearities.

### Vanishing Gradient Problem

- Sigmoid functions are used frequently in neural networks to activate neurons. 
    - It is a logarithmic function with a characteristic S shape. 
    - The output value of the function is between 0 and 1. 
    - The sigmoid function is used for activating the output layers in binary classification problems. 
    - It is calculated as follows:

![sigmoid](./jacob-vanishing-gradient-1.png)

- First derivatives of sigmoid functions are bell curves with values ranging from 0 to 0.25: 

![firstderivative](./jacob-vanishing-gradient-2.jpg)

- In back propagation, the new weight(wnew) of a node is calculated using the old weight(wold) and product of the learning rate(ƞ) and gradient of the loss function.


![gradient](./jacob-vanishing-gradient-5.jpg)

![new weight](./jacob-vanishing-gradient-6.jpg)

- With the chain rule of partial derivatives, we can represent gradient of the loss function as a product of gradients of all the activation functions of the nodes with respect to their weights. 
    - Therefore, the updated weights of nodes in the network depend on the gradients of the activation functions of each node.

- For the nodes with sigmoid activation functions, we know that the partial derivative of the sigmoid function reaches a maximum value of 0.25. 
    -  <b>When there are more layers in the network, the value of the product of derivative decreases until at some point the partial derivative of the loss function approaches a value close to zero, and the partial derivative vanishes. </b> 
        - We call this the vanishing gradient problem.

- With shallow networks, sigmoid function can be used as the small value of gradient does not become an issue. 
    - When it comes to deep neural networks, the vanishing gradient could have a significant impact on performance. 
        - The weights of the network remain unchanged as the derivative vanishes. 
        - During back propagation, a neural network learns by updating its weights and biases to reduce the loss function. 
            - In a network with vanishing gradient, the weights cannot be updated, so the network cannot learn.     
                - The performance of the network will decrease as a result.

### Method to Overcome the problem

- Instead of sigmoid, use an activation function such as ReLU.
    - Rectified Linear Units (ReLU) are activation functions that generate a positive linear output when they are applied to positive input values. 
        - If the input is negative, the function will return zero.

![ReLU](./jacob-vanishing-gradient-7.jpg)

- The derivative of a ReLU function is defined as 1 for inputs that are greater than zero and 0 for inputs that are negative. 
    - The graph shared below indicates the derivative of a ReLU function.

![derivative of a ReLU function](./jacob-vanishing-gradient-8.png)

- The partial derivative of the loss function will be having values of 0 or 1 which prevents the gradient from vanishing! :-)
    - The use of ReLU function thus prevents the gradient from vanishing.


### The Max-Pooling Operation

Let’s display the architecture of the convnet so far. You can see that the output of every Conv2D and MaxPooling2D layer is a 3D tensor of shape (height, width, channels).

In [2]:
model.summary() 

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 13, 13, 32)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 5, 5, 64)         0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 3, 3, 64)          36928     
                                                                 
Total params: 55,744
Trainable params: 55,744
Non-traina

- In the example, the size of the feature maps is halved after every MaxPooling2D layer. 
    - For instance, before the first MaxPooling2D layers, the feature map is 26 × 26, but the max-pooling operation halves it to 13 × 13. 
    - <b> That’s the role of max pooling: to aggressively downsample feature maps. </b>

- In short, the reason to use downsampling is to reduce the number of feature-map coefficients to process, as well as to induce spatial-filter hierarchies by making successive convolution layers look at increasingly large windows.

![max pooling](./max_pooling.png)

![average pooling](./average_pooling.png)

So far, this is similar to what we have:

![convolution](./Basic-CNN-architecture-and-kernel-A-typical-CNN-consists-of-several-component-types.ppm.png)

- Adding a classifier on top of the convnet: 
    - First we have to flatten the 3D oputputs to 1D, and then add a few Dense layers on top.
    - We'll do 10-way classification, using a final layer with 10 outputs and a softmax activation. 
        - In multiclass classification the softmax activation is often used.

        ![softmax](./Softmax-function-image.png)




In [None]:
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

In [4]:
model.summary()
#As you can see, the (3, 3, 64) outputs are flattened into vectors of shape (576,) 
#before going through two Dense layers.
#A dense layer is a layer that is deeply connected with its preceding layer which means the neurons 
#of the layer are connected to every neuron of its preceding layer.

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 13, 13, 32)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 5, 5, 64)         0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 3, 3, 64)          36928     
                                                                 
 flatten (Flatten)           (None, 576)               0

## Let's train the convnet on MINIST images

- For each pixel you would need 3 scalars (each for one channel), so it would be 60000 (# images) x 28x28 x 3 (# channels).

- And how many channels you need when the image is in greyscale? Just one, so it would be 60000 x 28 x 28 x 1.

- Also, we’ll preprocess the data by reshaping it into the shape the network expects and scaling it so that all values are in the [0, 1] interval. 
    - Previously, our training images, were stored in an array of shape (60000, 28, 28) of type uint8 with values in the [0, 255] interval. 
        - We transform it into a float32 array of shape (60000, 28 * 28) with values between 0 and 1.

- sparse_categorical_crossentropy: Used as a loss function for multi-class classification model where the output label is assigned integer value.

![firstderivative](./1*Y2KPVGrVX9MQkeI8Yjy59Q.gif)


In [9]:
from tensorflow.keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype("float32") / 255
model.compile(optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])
model.fit(train_images, train_labels, epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fd487f2c0a0>

In [6]:
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f"Test accuracy: {test_acc:.3f}")

Test accuracy: 0.991


### Understanding Border Effects and Padding



- Padding is the addition of (typically) 0-valued pixels on the borders of an image. 
    - This is done so that the border pixels are not undervalued (lost) from the output because they would ordinarily participate in only a single receptive field instance. 

- In the figure below, we pad a 3 × 3 input, increasing its size to 5 × 5. 
    - The corresponding output then increases to a 4 × 4 matrix. 
    - The shaded portions are the first output element as well as the input and kernel tensor elements used for the output computation: 0 × 0 + 0 × 1 + 0 × 2 + 0 × 3 = 0.

![padding](./padding.png)