# Deep Learning for Computer Vision

## Intro to convnets (convolutional neural networks)

- common deep learning model used in computer vision application


In [1]:
from keras import layers
from keras import models

Using TensorFlow backend.


In [9]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))


- convets takes input tensors of the shape (image_height, image_width, image_channels)

In [10]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_7 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 3, 3, 64)          36928     
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
_________________________________________________________________


- output of every Conv2D and MaxPooling2D layers is a 3D tensor of shape (height, width, channels)
- the width and height dimensions tend to shrink as you go deeper in the network
- the number of channels is controlled by the first argument passed to Conv2D (32/64)
- next step is to feed the last output tensor into a densely connected classifier network
- these classifiers process vectors(1D), need to flattern the current output 3D tensor
- add a few Dense layers


In [11]:
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

In [12]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_7 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten_2 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 64)                36928     
__________

In [13]:
from keras.datasets import mnist
from keras.utils import to_categorical

In [14]:
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz


In [17]:
train_images.shape

(60000, 28, 28)

In [24]:
test_images.shape

(10000, 28, 28)

In [20]:
train_images = train_images.reshape((60000, 28, 28, 1))

In [23]:
train_images = train_images.astype('float32')/255

In [25]:
test_images = test_images.reshape((10000, 28, 28, 1))

In [26]:
test_images = test_images.astype('float32')/255

In [28]:
train_labels = to_categorical(train_labels)

In [32]:
test_labels = to_categorical(test_labels)

In [34]:
model.compile(optimizer = 'rmsprop', 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f311e5b8518>

In [35]:
test_loss, test_acc = model.evaluate(test_images, test_labels)



In [36]:
test_acc

0.9912

### The convollution operation

The difference between a densely connected layer and a convolution layer is: Dense layers learn global patterns in their input feature space, whereas convolution layers learn local patterns.

Properties of convolution layer:
- the pattern they learn are translation invariant 
    - after learning a certain pattern in the lower right corner of a picture, a convnet can recognize it anywhere, where a densely connected network would have to learn the pattern anew
    - this makes convets data efficient when processing images(need fewer training samples)
- they can learn spatial hieraichies of pattern
    - first convolution layer will learn small local petterns such as edges
    - a second convolution layer will learn larger patterns made of the features of the first layers and so on
    - allos convets to efficiently learn increasingly complex and abstract visual concepts


- convolutions operates over 3D tensors, called feature maps, with two spatial axes(height and width) as well as a depth axis(channel axis). 
- For an RGB(red, green, blue) images, the dimension of the depth axis is 3, because the image has three color channels
- For black and white image, the depth is 1(levels of gray)
- the convolution operation extracts pathes from its input feature maps and applies the same transformation to all of these patches, producing an output feature map
- the output feature map is still 3D tensor: width and height, its depth can be arbitrary, and is a parameter of the layer
- different channels in depth axis no longer stands for specific colors, but filters.
- filters encode specific aspects of the input data, ex presence of a face in the input.

convoutions are defined by two key parameters:
- size of patches exracted from the input: (3, 3) is a common choice
- depth of the output feature map: the number of filters computed by the convolution, example started with a depth of 32 and ended with a depth of 64
- Conv2D(output_depth, (window_height, window_width))

A convolution works by
- sliding these patch window over the 3D input feature map, stopping at every possible location and extract the 3D patch of surrounding features with shape (window_height, window_width, input_depth)
- each 3D patch is transformed(wia a tensor product with the same learned weight matrix, convolution kernel) into a 1D vector of shape(output_depth)
- reassemble into a 3D output map of shape (height, width, output_depth)