# Convolutional networks

A basic convnet is a stack of ```Conv2D``` and ```MaxPooling2D``` layers:

In [None]:
from keras import layers
from keras import models

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

In [None]:
model.summary()

**Observations:**

- The output of every ```Conv2D``` and ```MaxPooling2D``` layer is a 3D tensor of shpae (```height```, ```width```, ```channels```).

- ```width``` and ```height``` dimensions tend to shrink as the network goes deeper.

- The number of channels is controled by the first argument passed to the ```Conv2D``` layers (32 or 64).

**Next step: add a classifier on top of the convnet**

Feed the last output tensor into a densely connected classifier network.
But first, we have to flatten the 3D outputs to 1D:

In [None]:
model.add(layers.Flatten())

We do a 10-way (10 digits of the MNIST dataset) classification:

In [None]:
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

In [None]:
model.summary()

Now, train the convnet on the MNIST digits:

In [None]:
from keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype('float32') / 255

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

model.compile(optimizer='rmsprop',
            loss='categorical_crossentropy',
            metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=5, batch_size=64, verbose=0)

Evaluate the model on the test data:

In [None]:
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=0)
print("Test accuracy: ", test_acc)

A densely connected network has a test accuracy of aprox 97.8%, the basic convet goes up to 99.3%

## The convolution operation
```Dense``` layers learn **global** patterns in their input feature space.

-> Convolution layers learn **local** patterns. In the case of images, patterns found in small 2D windows of the inputs. In the previous example, these windows were all 3x3.

This gives convnets two properties:
- ***The patterns they learn are translation invariant:*** So after learning a certain pattern in the lower-right corner of a picture, it can recognize it anywhere.
- ***They can learn spatial hierarchies of patterns:*** They efficiently leanr increasingly complex and abstract visual concepts (so a first conv layer will learn small local patterns as edges, and the second will learn larger patterns made of features of the first layer, and so on.)

Convolutions operate over 3D tensors, called *feature maps*, with two spatial axes (```height``` and ```width```) as well as a ```depth``` axis (also called the *channels* axis). So, an RGB image has ```depth = 3``` (red, green, and blue). A black-and-white picture, ```depth = 1``` (levels of gray).

The convolution operation results on an *output feature map* which is still a 3D tensor with ```width``` and ```height```, and ```depth``` but here it doesn't stand for colors, it stands for *filters*.

Convolutions are defined by two key parameters:
- ***Size of the patches extracted from the inputs:***  typically 3 × 3 or 5 × 5.
- ***Depth of the output feature map:***  the number of filters computed.

##### Why can the output width and height be different from the input width and heigth?
=> **BORDER EFFECTS AND PADDING**
In Conv2D layers, padding is configurable via the padding argument, which can be: 
- "valid", which means no padding (only valid window locations will be used), so the output will "shrink" (border effect)
- "same", which means “pad in such a way as to have an output with the same width and height as the input.”

=> **STRIDES**
Rarely used, but basically the *stride* is the distance between two successive windows, which defaults to 1. If its set to be higher than 1, the width and height of the feature map are downsampled by the same factor of the stride (plus border effects). 

#### Max-pooling operation
The role of max pooling: to aggressively downsample feature maps, much like
strided convolutions. It consists of extracting windows from the input feature maps and outputting the max value of each channel.

Usually done with 2 x 2 windows and stride 2 => downsample by a factor of 2.

**Why use max-pooling?**
- reduce the number of feature-map coefficients to process
- induce spatial-filter hierarchies by making successive convolution layers look at increasingly large windows (in terms of the fraction of the original input they cover)