# 14. Deep Computer Vision Using Convolutional Neural Networks

Convolutional neural networks (CNNs) emerged from the study of the brain’s visual cortex, and they have been used in image recognition since the 1980s.

### Convolutional Layers

Neurons in the first convolutional layer are **not** connected to every pixel in the input image but only to pixels in their receptive fields. Successive layers only concentrate on a rectangle of neurons in the previous layers. 

**Note**: in CNNs each layer is in 2D.

A neuron located in row $i$, column $j$ of a given layer is connected to the outputs of the neurons in the previous layer located in rows $i$ to $i + f_h – 1$, columns $j$ to $j + f_w – 1$, where $f_h$ and $f_w$ are the height and width of the receptive field. In order to have same height and weight for layers, zeros are added around inputs (**Zero padding**).  

![CNN_Layers](images/13.CNN_Layers.png)

The shift from one receptive field to the next is called the **stride**. The ouput layer can be smaller than the input layer. 

#### Filters

Or **convolutional kernels** can be represented as a small image the size of the receptive field. These weights will put particular emphasis to certain features of the data (hence the names _feature map_ for their otput), e.g. horizontal lines. 

#### Stacking Multiple Feature Maps

In short, a convolutional layer simultaneously applies multiple trainable filters to its inputs, making it capable of detecting multiple features anywhere in its inputs.

#### TensorFlow Implementation

Input image = 3D tensor [_height, width, channels_]  
Mini-batch = 4D tensor [_mini-batch size, height, width, channels_]  
CNN weights = 4D tensor [$f_h$, $f_w$, $f_{n'}$, $f_n$]  
Bias term = 1D tensor [$f_n$]  

### Pooling Layers

The goal of pooling layers is **subsample** the input image in order to reduce the computational load, the memory usage, and the number of parameters. A pooling neuron has no weights, all it does is **aggregate inputs** using an aggregator function such as max or mean. 

#### TensorFlow Implementation

In [1]:
from tensorflow import keras

# using max
max_pool = keras.layers.MaxPool2D(pool_size=2)
# using average
avg_pool = keras.layers.AvgPool2D(pool_size=2)

Interestingly, mean pooling is more popular than average pooling probable because it maintains only the strongest feature, eliminating potential noise. Also, it takes less to compute and offers stronger translation invariance.

### CNN Architectures

Typical CNN architectures stack a few convolutional layers (each one generally followed by a ReLU layer), then a pooling layer, then another few convolutional layers (+ReLU), then another pooling layer, and so on. Images gets **smaller and deeper** as they progress through the network. 

Example:

In [2]:
model = keras.models.Sequential([
    keras.layers.Conv2D(64, 7, activation="relu", padding="same",
                        input_shape=[28, 28, 1]),
    # max pooling layer (pooling size 2)
    keras.layers.MaxPooling2D(2),
    # doubling n of filers in after each polling layer
    keras.layers.Conv2D(128, 3, activation="relu", padding="same"),
    keras.layers.Conv2D(128, 3, activation="relu", padding="same"),
    keras.layers.MaxPooling2D(2),
    keras.layers.Conv2D(256, 3, activation="relu", padding="same"),
    keras.layers.Conv2D(256, 3, activation="relu", padding="same"),
    keras.layers.MaxPooling2D(2),
    # flatten input into 1D array
    keras.layers.Flatten(),
    # fully connected network
    keras.layers.Dense(128, activation="relu"),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(64, activation="relu"),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(10, activation="softmax")
])

### Using Pretrained Models from Keras

We can load pretrained networks very easily in Keras: 

In [3]:
model = keras.applications.resnet50.ResNet50(weights="imagenet")

A local file was found, but it seems to be incomplete or outdated because the auto file hash does not match the original value of 2cb95161c43110f7111970584f804107 so we will re-download the data.
Downloading data from https://github.com/keras-team/keras-applications/releases/download/resnet/resnet50_weights_tf_dim_ordering_tf_kernels.h5


### Pretrained Models for Transfer Learning

If we want to build an image classifier but we do not have enough training data, it is often a good idea to reuse the lower layers of a pretrained model.

In [4]:
import tensorflow_datasets as tfds

dataset, info = tfds.load("tf_flowers", as_supervised=True,
with_info=True)
dataset_size = info.splits["train"].num_examples # 3670
class_names = info.features["label"].names # ["dandelion", "daisy", ...]
n_classes = info.features["label"].num_classes # 5



Unfortunately, we will need to do the splitting ourselves:

In [5]:
test_split, valid_split, train_split = tfds.Split.TRAIN.subsplit([10, 15, 75])

In [6]:
test_set = tfds.load("tf_flowers", split=test_split, as_supervised=True)
valid_set = tfds.load("tf_flowers", split=valid_split,
as_supervised=True)
train_set = tfds.load("tf_flowers", split=train_split,
as_supervised=True)

Preprocessing (our CNN expects 224 x 224):

In [7]:
def preprocess(image, label):
    resized_image = tf.image.resize(image, [224, 224])
    final_image = keras.applications.xception.preprocess_input(resized_image)
    return final_image, label

Let’s apply this preprocessing function to all three datasets, shuffle the training set, and add batching and prefetching to all the datasets:

In [9]:
import tensorflow as tf

batch_size = 32
train_set = train_set.shuffle(1000)
train_set = train_set.map(preprocess).batch(batch_size).prefetch(1)
valid_set = valid_set.map(preprocess).batch(batch_size).prefetch(1)
test_set = test_set.map(preprocess).batch(batch_size).prefetch(1)

Now we load the Xception model, pretrained on ImageNet, excluding top of network. Then we then add our own global average pooling layer, based on the output of the base model, followed by a dense output layer with one unit per class, using the softmax activation function. 

In [10]:
base_model = keras.applications.xception.Xception(weights="imagenet",
            include_top=False)
avg = keras.layers.GlobalAveragePooling2D()(base_model.output)
output = keras.layers.Dense(n_classes, activation="softmax")(avg)
model = keras.Model(inputs=base_model.input, outputs=output)

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.4/xception_weights_tf_dim_ordering_tf_kernels_notop.h5


It’s usually a good idea to freeze the weights of the pretrained layers, at least at the beginning of training:

In [11]:
for layer in base_model.layers:
    layer.trainable = False

Finally, we can compile the model and start training:

In [12]:
# WARNING：this could take a while without a GPU

optimizer = keras.optimizers.SGD(lr=0.2, momentum=0.9, decay=0.01)
model.compile(loss="sparse_categorical_crossentropy",
optimizer=optimizer,
        metrics=["accuracy"])
history = model.fit(train_set, epochs=5, validation_data=valid_set)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


After top layers have been trained, we are ready to unfreeze all the layers (or just the top ones) and continue training. Let's not forget to **compile** the model:

In [13]:
# WARNING: this also may take a while 

for layer in base_model.layers:
    layer.trainable = True
    
# lower learning rate and decay
optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.9, decay=0.001)
model.compile(loss="sparse_categorical_crossentropy",
             optimizer=optimizer,
             metrics=["accuracy"])
history = model.fit(train_set, epochs=5, validation_data=valid_set)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


But there is more to computer vision than just classification. After knowing _what_ things are, how about knowing _where_ are they? 

### Classification and Localization

A bouding box around an object can be predicted using four coordinates: horizontal and vertical coordinates of the center + height and width. 

This means we can accomplish this by adding a second dense output layer with four units (typically on top of the global average pooling layer):

In [14]:
base_model = keras.applications.xception.Xception(weights="imagenet",
                                                include_top=False)
avg = keras.layers.GlobalAveragePooling2D()(base_model.output)
class_output = keras.layers.Dense(n_classes, activation="softmax")(avg)
loc_output = keras.layers.Dense(4)(avg)
model = keras.Model(inputs=base_model.input,
                    outputs=[class_output, loc_output])
model.compile(loss=["sparse_categorical_crossentropy", "mse"],
            loss_weights=[0.8, 0.2], # depends on what you care most
about
            optimizer=optimizer, metrics=["accuracy"])

SyntaxError: invalid syntax (<ipython-input-14-0c52cbee00e7>, line 11)

As evaluation metric, it is commonplace to use **Intersection over Union**, which as the name implies is （area of intersection between predicted box and actual box) / (area union). 

### Object Detection

Previously, object detection was done primarily by sliding rectangles of varying sizes to the image, followed by getting rid of unnecessary boxes. Usually this is done by:

1. Selecting all boxes where our _objectness score_ is higher than a certain threshold
2. Finding the bounding box with the highest objectness score, and getting rid of all the other bounding boxes that overlap a lot with it
3. Repeat until there is only one box left

Since 2015 there is a new guy in town. 

### mean Average Precision (mAP)

Very common object detection metric. Suppose we have a classfier with 90% precision at 10% recall, and 96% precision at 20% recall. Obviously the second one is superior to the first. So what we should be looking at is the **maximum** precision model that satisfies **at least** a minimum recall threshold.

What we do is to calculate the precision at different levels of recall and average it. Pretty straightforward. 

### Semantic Segmentation

In semantic segmentation, every pixel is classified according to the class of the object it belongs to. Pixel of the same class all end up together. 

As for object detection, there are many approaches. A fairly simple one was suggested by Jonathan Long et al. in 2015. In short:

* Take pretrained CNN and turn it into a FCN
* The CNN applies an overall stride of 32 to the input image (so last layer output feature maps are 32 times smaller than original)
* Add unsampling layer to get to full resolution back. Several approaches possible here:
    * Transposed convolutional layer (first stretching the image by inserting empty rows and columns (full of zeros), then performing a regular convolution
    * Regular convolutional layer that uses fractional strides
    