# Chapter5 : Deep learning for computer vision

This chapter introduces convolutional neural networks, also known as ***convnets***, a type of deep learning model almost universally used in computer vision applications.

## 5.1 Introduction to convnets

#### Instantiating a small convert

In [1]:
from keras import layers
from keras import models

  '{0}.{1}.{2}'.format(*version.hdf5_built_version_tuple)
Using TensorFlow backend.


In [2]:
model = models.Sequential()

model.add(layers.Conv2D(32, (3,3), activation = 'relu', input_shape = (28,28,1)))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Conv2D(64, (3,3), activation = 'relu'))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Conv2D(64, (3,3), activation = 'relu'))

Instructions for updating:
Colocations handled automatically by placer.


* a convnet takes an input tensor of shape (image_height, image_width, image_channels)

In [3]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 64)          36928     
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
_________________________________________________________________


#### Adding a classifier on top of the convert

In [4]:
model.add(layers.Flatten())
model.add(layers.Dense(64, activation = 'relu'))
model.add(layers.Dense(10, activation = 'softmax'))

In [5]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten_1 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                36928     
__________

#### Training the convnet on MNIST images

In [6]:
from keras.datasets import mnist
from keras.utils.np_utils import to_categorical

In [7]:
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

In [8]:
train_images = train_images.reshape((60000,28,28,1))
train_images = train_images.astype('float32')/255

test_images = test_images.reshape((10000,28,28,1))
test_images = test_images.astype('float32')/255

train_labels = to_categorical(train_labels)

test_labels = to_categorical(test_labels)

In [10]:
model.compile(optimizer = 'rmsprop',
              loss = 'categorical_crossentropy',
              metrics = ['accuracy'])

model.fit(train_images, train_labels,
          epochs = 5,
          batch_size = 64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1fd44494358>

In [11]:
test_loss, test_acc = model.evaluate(test_images, test_labels)
print('test data loss :', test_loss)
print('test data accuracy :', test_acc)

test data loss : 0.030063762222915194
test data accuracy : 0.9927


> Whereas the densely connected network from chapter 2 had a test accuracy of 97.84%, the basic convnet has a test accuracy of 99.27%

Why does this simple convnet work so well, compared to a densly connected model? To answer this, let's dive into what the **Conv2D** and **MaxPooling2D** layers do.

### 5.1.1 The convolution operation

The fundamental difference between a densely connected layer and a convolution layer is this:
  * **Dense** layers learn ***global patterns*** in their input feature space.
  * **Convolution** layers learn ***local patterns***: in the case of images, patterns found in small 2D windows of the inputs.
  
This characteristic gives convnets two interesting properties:
  + *The patterns they learn are translation invariant.*
    - After learning a certain pattern in the lower-right corner of a picture, a convnet can recognize it anywhere.
    - This makes convnets data efficient when processing images, as they need fewer training samples to learn representations that have generalization power.
  + *They can learn spatial hierarchies of patterns.*
    - Convnets can efficiently learn increasingly complex and abstract visual concepts.

Convolutions are defined by two key parameters:
  + *Size of the patches extracted from the inputs*--These are typically 3x3 or 5x5.
  + *Depth of the output feature map*--The number of filters computed by the convolution.
  
Also, note that the output width and height may differ from the input width and height. They may differ for two reasons:
  + *Border effects*, which can be countered by padding the input feature map.
  + The use of *strides*, which will be defined in the following.

#### UNDERSTANDING BORDER EFFECTS AND PADDING

If you want to get an output feature map with the same spatial dimensions as the input, you can use *padding*.

*Padding* consists of adding an appropriate number of rows and columns on each side of the input feature map so as to make it possible to fit center convolution windows around every input tile.

#### UNDERSTANDING CONVOLUTION STRIDES

The other factor that can influence output size is the notion of *strides*. The distance between two successive windows is a parameter of the convolution, called its ***stride***, which defaults to 1.

*Strided convolutions*, which are convolutions with a stride higher than 1, downsampels the feature map. To downsample feature maps, instead of strides, we tend to use the ***max-pooling*** operation.

### 5.1.2 The max-pooling operation

The role of max pooling is to aggressively downsample feature maps.
Max pooling consists of extracting windows from the input feature maps and outputting the max values of each channel.

But why downsample feature maps this way? Why not remove the max-pooling layers and keep fairly large feature maps all the way up?

In [12]:
model_no_max_pool = models.Sequential()

model_no_max_pool.add(layers.Conv2D(32, (3,3), activation = 'relu', input_shape = (28,28,1)))
model_no_max_pool.add(layers.Conv2D(64, (3,3), activation = 'relu'))
model_no_max_pool.add(layers.Conv2D(64, (3,3), activation = 'relu'))

model_no_max_pool.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_4 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 24, 24, 64)        18496     
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 22, 22, 64)        36928     
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
_________________________________________________________________


> Two things are wrong with this setup
>  + It isn't conducive to learning a spatial hierarchy of features.
>  + The final feature map has too many parameters. This is far too large for such a small model and would result in intense overfitting.

In short, the reason to use downsampling is **to reduce the number of feature-map coefficients to process**, as well as to **induce spatial-filter hierarchies** by making successive convolution layers look at increasingly large windows.

Note that max pooling isn't the only way you can achieve downsampling. However, features tend to encode the spatial presence of some pattern or concept over the different tiles of the feature map, and it's more informative to look at the *maximal presence* of different features then at their *average presence*.