# Deeplearning - Anees Ahmad - 2021/07/18

# 8 Introduction to deep learning for computer vision
- convolutional neural networks
  - convnets
  - used universally in computer vision applications
  - image-classification problems
  - small training datasets

---

## 8.1 Introduction to convnets
- a basic convnet
  - a stack of Conv2D and MaxPooling2D layers.
  - convnet takes as input tensors of shape `(image_height, image_width,image_channels)`

- build the model using the Functional API

In [1]:
# Listing 8.1 Instantiating a small convnet
from tensorflow import keras 
from tensorflow.keras import layers
# define the input shape
# as we are dealing with MNIST data we know it is a 28*28 pixels graysacale image
inputs = keras.Input(shape=(28, 28, 1))

# a convent is stacks of Conv2D and MaxPooling2D layers
# filter is actually the nodes/channels, karnel_size is the weight
# same layer can be created using sequential class as follows
  # model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
# pool size defines the factor with which it scale down
  # model.add(layers.MaxPooling2D((2, 2)))
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(inputs)
x = layers.MaxPooling2D(pool_size=2)(x)

x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)

x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)

- Output of Conv2D and MaxPooling2D layer
  - rank-3 tensor of shape `(height, width, channels)`

In [2]:
# next we have stacks of Dense layer, whihc actually takes 1D tensor as input
# need to flattern the output of last Conv28 Layer
x = layers.Flatten()(x)
outputs = layers.Dense(10, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs) 

In [3]:
# Listing 8.2 Displaying the model’s summary
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d (Conv2D)             (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 13, 13, 32)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 5, 5, 64)         0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 3, 3, 128)         73856 

In [4]:
# Listing 8.3 Training the convnet on MNIST images
from tensorflow.keras.datasets import mnist
 
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype("float32") / 255
model.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])
model.fit(train_images, train_labels, epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f91d042f110>

In [5]:
# Listing 8.4 Evaluating the convnet
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f"Test accuracy: {test_acc:.3f}")

Test accuracy: 0.990


- With out convents we have an accuracy of 97.8%
- With convents we have accuracy of 99.1%

---

## 8.1.1 The convolution operation

- Dense vs Conolution Operation
  - Dense layers learn global patterns in their input feature space
  - convolution layers learn local patterns
    - in the case of images, patterns found in small 2D windows of the inputs

      ![](./snaps/8.1.PNG)

- Properties of convents
  - The patterns they learn are translation-invariant
    - After learning a certain pattern in the lower-right corner of a picture, a convnet can recognize it anywhere. 
    - A densely connected model would have to learn the pattern anew if it appeared at a new location.
    - This makes convnets data-efficient when processing images they need fewer training samples to learn representations that have generalization power.
  - They can learn spatial hierarchies of patterns.
    - A first convolution layer will learn small local patterns such as edges
    - a second convolution layer will learn larger patterns made of the features of the first
layers, and so on
    - This allows convnets to efficiently learn increasingly complex and abstract visual concepts, because the visual world is fundamentally spatially hierarchical.

- Convolutions operate over rank-3 tensors called feature maps
  - two spatial axes (height and width)
  - a depth axis (also called the channels axis).
- The convolution operation extracts patches from its input feature map and applies the same transformation to all of these patches, producing an output feature map.
  - This output feature map is still a rank-3 tensor it has
    - a width and a height. 
    - Its depth can be arbitrary, because the output depth is a parameter of the layer, and the different channels in that depth axis stand for filters. 
    - Filters encode specific aspects of the input data

- For istance take the following convet layer
  - `x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(keras.Input(shape=(28, 28, 1)))`
    - input feature map of size (28, 28, 1)
    - output feature map of size (26, 26, 32)
    - 32 filters over the input
      - each channel conatins 26 × 26 grid of values
        - which is response map of the filter over the input

- Convolutions are defined by two key parameters:
  - Size of the patches extracted from the inputs
  - Depth of the output feature map
- convolution kernel
- the output width and height may differ from the input width and height for two reasons:
  - Border effects, which can be countered by padding the input feature map
  - The use of strides, which I’ll define in a second


#### UNDERSTANDING BORDER EFFECTS AND PADDING
- A (5,5) input feature map to (3,3) output feature map

    ![](./snaps/8.2.PNG)

- If you want to get an output feature map with the same spatial dimensions as the input, you can use padding. Padding consists of adding an appropriate number of rows and columns on each side of the input feature map so as to make it possible to fit center convolution windows around every input tile.

    ![](./snaps/8.3.PNG)

- padding argument
  - "valid"
    - no padding
  - "same"
    - pad in such a way as to have an output with the same width and height as the input.

#### UNDERSTANDING CONVOLUTION STRIDES
- Stride
  -  distance between two successive windows is a parameter of the convolution

  ![](./snaps/8.4.PNG)

---

### 8.1.2 The max-pooling operation

- to aggressively downsample feature maps, much like strided convolutions
- extract windows from the input feature maps and outputting the max value of each channel. 
- each channel is transformed via a hardcoded max tensor operation.
- A big difference from convolution is that max pooling is usually done with 2 × 2 windows and stride 2

In [6]:
# Listing 8.5 An incorrectly structured convnet missing its max-pooling layers
inputs = keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(inputs)
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(10, activation="softmax")(x)
model_no_max_pool = keras.Model(inputs=inputs, outputs=outputs)

In [7]:
model_no_max_pool.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d_3 (Conv2D)           (None, 26, 26, 32)        320       
                                                                 
 conv2d_4 (Conv2D)           (None, 24, 24, 64)        18496     
                                                                 
 conv2d_5 (Conv2D)           (None, 22, 22, 128)       73856     
                                                                 
 flatten_1 (Flatten)         (None, 61952)             0         
                                                                 
 dense_1 (Dense)             (None, 10)                619530    
                                                                 
Total params: 712,202
Trainable params: 712,202
Non-trainab

- Problems with above model
  - It isn’t conducive to learning a spatial hierarchy of features. 
    - The 3 × 3 windows in the third layer will only contain information coming from 7 × 7 windows in the initial input. The high-level patterns learned by the convnet will still be very small with regard to the initial input, which may not be enough to learn to classify digits (try recognizing a digit by only looking at it through windows that are 7 × 7 pixels!. We need the features from the last convolution layer to contain information about the totality of the input.
  - The final feature map has 22 × 22 × 128 = 61,952 total coefficients per sample. This is huge. When you flatten it to stick a Dense layer of size 10 on top, that layer would have over half a million parameters. This is far too large for such a small model and would result in intense overfitting.

- use of downsampling
  - reduce the number of feature-map coefficients to process
  - induce spatial-filter hierarchies by making successive convolution layers look at increasingly large windows

- Why Maxpoling works batter
    - features tend to encode the spatial presence of some pattern or concept over the different tiles of the feature map (hence the term feature map), and it’s more informative to look at the maximal presence of different features than at their average presence.

---