In [1]:
from keras import layers
from keras import models

# A basic convnet
model = models.Sequential()

model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) # input_shape = (height, width, channel)
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

model.add(layers.Flatten()) # flatten the 3D outputs to 1D
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

print(model.summary())

Using TensorFlow backend.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten_1 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                36928     
__________

In [4]:
from keras.datasets import mnist
from keras.utils import to_categorical

# load the data
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype('float') / 255

test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype('float') / 255

train_labels = to_categorical(train_labels)
test_labels  = to_categorical(test_labels)

# train the model
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=5, batch_size=64)

# evaluate the model
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(test_loss, test_acc)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
0.03589198187187212 0.9923


In [None]:
"""
The Convolution Operation

1. The fundamental difference between a densely connected layer and a convolution layer is this: Dense layers learn global patterns
   in their input feature space (for example, for a MNIST digit, patterns involving all pixels), whereas convolution layers learn local
   patterns: in the case of images, patterns found in small 2D windows of the inputs. In the previous example, these windows were all 3 × 3.

2. Two interesting properties of convnet:
   -- The patterns they learn are translation invariant. After learning a certain pattern in the lower-right corner of a picture,
      a convnet can recognize it anywhere: for example, in the upper-left corner. A densely connected network would have to learn
      the pattern anew if it appeared at a new location. This makes convnets data efficient when processing images (because the visual
      world is fundamentally translation invariant): they need fewer training samples to learn representations that have generalization
      power.
   -- They can learn spatial hierarchies of patterns. A first convolution layer will learn small local patterns such as edges, a second
      convolution layer will learn larger patterns made of the features of the first layers, and so on. This allows convnets to efficiently
      learn increasingly complex and abstract visual concepts (because the visual world is fundamentally spatially hierarchical).

3. Convolutions operate over 3D tensors, called feature maps, with two spatial axes (height and width) as well as a depth axis (also called
   the channels axis). The convolution operation extracts patches from its input feature map and applies the same transformation to all of
   these patches, producing an output feature map. This output feature map is still a 3D tensor: it has a width and a height. Its depth
   can be arbitrary, because the output depth is a parameter of the layer, and the different channels in that depth axis no longer stand for
   specific colors as in RGB input; rather, they stand for filters. Filters encode specific aspects of the input data: at a high level,
   a single filter could encode the concept “presence of a face in the input,” for instance.

4.  In the MNIST example, the first convolution layer takes a feature map of size (28, 28, 1) and outputs a feature map of size (26, 26, 32):
    it computes 32 filters over its input. Each of these 32 output channels contains a 26 × 26 grid of values, which is a response map of
    the filter over the input, indicating the response of that filter pattern at different locations in the input。

5. Convolutions are defined by two key parameters:
   -- Size of the patches extracted from the inputs—These are typically 3 × 3 or 5 × 5.
   -- Depth of the output feature map—The number of filters computed by the convolution.

6. In Conv2D layers, padding is configurable via the padding argument, which takes two values: "valid", which means no padding
   (only valid window locations will be used); and "same", which means “pad in such a way as to have an output with the same width
   and height as the input.” The padding argument defaults to "valid".
   
7. The distance between two successive windows is a parameter of the convolution, called its stride, which defaults to 1. It’s possible
   to have strided convolutions: convolutions with a stride higher than 1. Using stride 2 means the width and height of the feature map
   are downsampled by a factor of 2 (in addition to any changes induced by border effects). To downsample feature maps, instead of strides,
   we tend to use the max-pooling operation.

8.  Max pooling consists of extracting windows from the input feature maps and outputting the max value of each channel. It’s conceptually
    similar to convolution, except that instead of transforming local patches via a learned linear transformation (the convolution kernel),
    they’re transformed via a hardcoded max tensor operation. A big difference from convolution is that max pooling is usually done
    with 2 × 2 windows and stride 2, in order to downsample the feature maps by a factor of 2. On the other hand, convolution is typically
    done with 3 × 3 windows and no stride (stride 1).

"""

In [5]:
"""
Why max-pooling?

1. It isn’t conducive to learning a spatial hierarchy of features. The 3 × 3 windows in the third layer will only contain information
   coming from 7 × 7 windows in the initial input. The high-level patterns learned by the convnet will still be very small with regard to
   the initial input, which may not be enough to learn to classify digits (try recognizing a digit by only looking at it through windows
   that are 7 × 7 pixels!). We need the features from the last convolution layer to contain information about the totality of the input.

2. The final feature map has 22 × 22 × 64 = 30,976 total coefficients per sample. This is huge. If you were to flatten it to stick
   a Dense layer of size 512 on top, that layer would have 15.8 million parameters. This is far too large for such a small model and would
   result in intense overfitting.

3. In short, the reason to use downsampling is to reduce the number of feature-map coefficients to process, as well as to induce
   spatial-filter hierarchies by making successive convolution layers look at increasingly large windows.

4. Note that max pooling isn’t the only way you can achieve such downsampling. As you already know, you can also use strides in the prior
   convolution layer. And you can use average pooling instead of max pooling, where each local input patch is transformed by taking
   the average value of each channel over the patch, rather than the max. But max pooling tends to work better than these alternative
   solutions. In a nutshell, the reason is that features tend to encode the spatial presence of some pattern or concept over the different
   tiles of the feature map, and it’s more informative to look at the maximal presence of different features than at their average presence.
   So the most reasonable subsampling strategy is to first produce dense maps of features (via unstrided convolutions) and then look
   at the maximal activation of the features over small patches, rather than looking at sparser windows of the inputs
   (via strided convolutions) or averaging input patches, which could cause you to miss or dilute feature-presence information.
"""

model_no_max_pool = models.Sequential()

model_no_max_pool.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model_no_max_pool.add(layers.Conv2D(64, (3, 3), activation='relu'))
model_no_max_pool.add(layers.Conv2D(64, (3, 3), activation='relu'))

print(model_no_max_pool.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_4 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 24, 24, 64)        18496     
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 22, 22, 64)        36928     
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
_________________________________________________________________
None
