### Model validation

So far the way we used to see how well our model generalizes on a dataset was to use training and validation dataset, however this can result in a heavy bias towards the result in the validation set thus when deciding which model to choose it would be preferred to use another dataset, the test set. 

![Model validation](part3_images/model_validation.png)

We create a validation set to:

- Measure how well a model generalizes, during training
- Tell us when to stop training a model; when the validation loss stops decreasing (and especially when the validation loss starts increasing and the training loss is still decreasing)

![Model epochs](part3_images/model_training_validation.png)

You can check an [example](convolutional-neural-networks/mnist-mlp/mnist_mlp_solution_with_validation.ipynb) for model that uses both validation and test datasets.

### Image Classification Steps

Below is an overview of the image classification process using deep learning.

![Image classification steps](part3_images/image_class_steps.png)


### MLP vs CNN

While on MNIST dataset, an ML can do just as good of a job as a CNN the differences are very large in practice. There are are a few reasons why it has a similar performance on MNIST:
- dataset is clean
- well-centered
- pre-processed
- roughly the same size

The issue with ML is that it requires a flat vector therefore losing te knowledge of spatial arrangements of the pixels. That's exactly where a CNN can shine, it is designed to work with multi-dimensional data and recognize patterns in the image thanks to its architecture. Link to ![MNIST database](http://yann.lecun.com/exdb/mnist/).

## Convolutional Neural Networks (CNN)


### Local Connectivity

![MLP vs CNN](part3_images/mlp_vs_cnn.png)

Based on pros and cons, it can be argued that we need a completely different approach to handling the image input if we were to take advantage of the spatial patterns. 

The case in point being, is every node needed to be connected with the input? No, and we'll argue why. We can focus our nodes into specific regions thus it would make our layers locally connected with only the parameters they need rather than a densely connected layer so each hidden node sees about a quarter of the original image as in the example. 

Every hidden node still reports to the output layer where the output layer combines the discovered patterns learned separately in each region.

This makes it less prone to overfitting and we can also use a 2D matrix as input.

More layers -> more complex patterns.

What is more useful, to have each of the hidden nodes within a collection to share a common group of weights thus enabling the ability of regions to share the same kind of information.

![Local connectivity](part3_images/local_connectivity.png)

### Filters and the convolutional layer

![Filters and the convolutional layers](images/filters_conv_layers.png)

The Convolutional layer is a special type of NN that remembers **spatial patterns**. Several filters with different purposes are applied to input to extract edges, colors etc.

#### Filters and edges

Filters applied to the image can enable us to detect **spatial patterns** which you can think of as the **color or shape**, the same way you can detect **color intensity** which enables us to detect the shape boundary.

#### Frequency in images

In images, frequency is the rate of change
- level of brightness changes quickly from one pixel to the next
- a low frequency may be one that is relatively uniform in brightness or changes very slowly
- high frequency in an image means that intensity changes a lot

To better picture this, think about the sound where frequency refers to how fast a sound wave is oscillating.

![Frequency](images/frequency_images.png)

Essentially, **high frequency components also correspond to the edges of objects** in images which can help us classify those objects.

#### High pass filters

**Filters** are used to:
1. filter out unwanted information
2. amplify features (e.g. objected boundaries)

**High-pass filters** are used to:
- sharpen an image
- enhance high-frequency parts of an image (e.g. detect a line)

**Edge detection**, edges represent areas within an image where the intensity changes very quickly and often indicate object boundaries. How does it work? This is done with the help of convolutional kernels. 

![Edge detection filter](part3_images/edge_det_filter.png)

**Kernel** is just a matrix that modifies an image. Essentially for edge detection is important that all elements sum 0 because the filter is calculating the difference or change between neighbouring pixels. 

If it is not 0 the value can be positive or negative which in turn brightens or darkens the entire filtered image. 

Convolution kernel representation: $K * F(x,y) = output image$ where * does not represent multiplication.

![Convolutional kernels](part3_images/convolutional_kernels.png)

How do convolutional kernels work? The kernel is placed over the pixel in the center, then a multiplication is done between the weights of the filter and the values of the filters, then we sum all of them and we get the pixel value in the output image. 

This sum becomes the value for the corresponding pixel at the same location in the output image. This gets repeated for every pixel position in the image.

**For edge detection**:
- center pixel is most important
- followed by top, bottom left and right (these increase the contrast of the image)
- corners receive no weights because they are the farthest from the center

Kernel convolution relies on centering a pixel and looking at it's surrounding neighbors. So, what do you do if there are no surrounding pixels like on an image corner or edge?
- Extend The nearest border pixels are conceptually extended as far as necessary to provide values for the convolution. Corner pixels are extended in 90° wedges. Other edge pixels are extended in lines.
- Padding The image is padded with a border of 0's, black pixels.
- Crop Any pixel in the output image which would require values from beyond the edge is skipped. This method can result in the output image being slightly smaller, with the edges having been cropped.

![Examples of conv kernels](part3_images/examples_of_conv_kernels.png)

In this example above, option **d** would make a proper choice for finding and enhancing the horizontal edges, while **c** could be used for vertical.


### Convolutional Layer

Convolutional layer is what preserves spatial information and learn to extract features such as edges of objects.

The convolutional layer is produced by applying a series of many different image filters, also known as convolutional kernels, to an input image. 

![Layers in a CNN](part3_images/layers_cnn.png)

**4 kernels = 4 filtered images = depth of 4!** => forms a convolutional layer.

Filters are used to capture characteristics of the image and it's common to have tens to hundreds of collections of nodes each with their own filters. 

![Input to conv](part3_images/input_to_conv_layer.png)

What we can see above are the filters specified for vertical and horizontal edges wich are then convolving over the input layer thus resulting in a set of nodes (4 in this case, one per filter) called as **feature maps** or **activation maps**.

To visualize the filtered outputs of the convolutional layers check this [notebook](convolutional-neural-networks/conv-visualization/conv_visualization.ipynb).

Below is an example of the computation of a single node in a convolutional layer for an input layer (RGB image).  

![Conv layer in RGB](part3_images/conv_layer_rgb.png)

Note, a trained CNN will learn the values of weights for filters.

The 3D array obtained can be used as input to another convolutional layer to discover patterns within patterns and so on.

Essentially, they aren't very different from dense layers. Moreover, inference works the same, both having weights and biases and initialized weights randomly, loss functions. What makes a CNN special is the ability to determine what kind of patterns it needs to detect.

### Stride and padding

The behavior of the convolutional layer can be done by specifying the number of filters and the size of each filter. e.g. to increase number of nodes, increase number of filters. To increase the size of detection patterns, increase the size of the filter.

**Stride** refers to the amount by which a filter slides over the image.

**Padding** is the process of adding 0s at the edges in order for the filter to move and capture more regions. While the option of discarding the edges is tempting, it is likely that we lose information from those regions.

### Pooling

What pooling does is help us reduce the dimensionality of the convolutional layers. Essentially, a complicated dataset requires many filters -> larger stack -> large dimensionality -> more parameters -> overfitting.

Commonly, there are two types of pooling:

#### Max pooling 

Takes the stack of feature maps as input, then use a defined window size and stride, then takes the maximum value of each window, then the output is stacked with the same number of feature maps but each feature map has been reduce in width and height. It works in a similar way as compression where only the most important pixels are selected. Below you can see a visualizatio, in the example below the convolutional layer has been reduced to half.

![Max pooling](part3_images/max_pooling.png)

#### Average pooling

The other option,  it's as the name says, performs an averaging of the values inside a filter. E.g. In a 2x2 window, this operation will see 4 pixel values, and return a single, average of those four values, as output! This type is not very commonly used for image classification, being preferred for smoothing applications instead.

### Capsule Networks

Are a special type of network in that it detects parts of the objects in an image and represent spatial relationships between the parts. While CNNs also focus on the spatial patterns it discards spatial information through the pooling layers thus a lot of research has been done concerning this aspect. 

Capsule networks are able to recognize the same object, like a face, in a variety of different poses and with the typical number of features (eyes, nose , mouth) even if they have not seen that pose in training data. Capsule networks are made of parent and child nodes that build up a complete picture of an object.

![Capsule networks](part3_images/capsule_networks1.png)

- Each node in a tree represents a single capsule.
- Each leaf represents a single, focused observation.

But what are capsules though? Are they different from a NN? Capsules are a collection of nodes, each of which contains information about a specific part; part properties like width, orientation, color, and so on. The important thing to note is that each capsule outputs a vector with some magnitude and orientation. So, they somewhat follow the principle of locally connected in the terms of different regions working on specific tasks.

- Magnitude (m) = the probability that a part exists; a value between 0 and 1.
- Orientation (theta) = the state of the part properties.

These output vectors allow us to do some powerful routing math to build up a parse tree that recognizes whole objects as comprised of several, smaller parts! The magnitude is a special part property that should stay very high even when an object is in a different orientation.

For a deeper dive into capsule networks, Udacity has provided a [blog post](https://cezannec.github.io/Capsule_Networks/) and a [github repo](https://github.com/cezannec/capsule_net_pytorch).

### Increasing Depth

Recall that a common problem in image classification is resizing, all the images are required to have the same size before anything else. Other preprocessing steps: normalization, conversion to tensor.

It;s common to resize to a square with spatial dimension of power of or divisible by power of two. 

Input array will always be taller and wider than it is deeper. The depth represent the channels e.g RGB = 3, grayscale = 1

![CNN layers](part3_images/cnn_basics.png)

### The goal of a CNN architecture:

- take an array and make it much deeper than it is taller and wider
- convolutional layers are used to make the array deeper
- max pooling to decrease the x, y dimensionality

![CNN architecture](part3_images/cnn_arch_example.png)

Why do we need padding? because they repersent the size of borders around the image. Essentially, when we create a convolutional alyer, we move a square filter over an image using center-pixel as the anchor. Thus, the **kernel cannot perfectly overlay the edges of the image**. Padding will us to control the spatial of the output volumes

e.g. maxpooling of (2, 2) or (4, 4) where first number is the kernel size, second the stride. This will **down-sample** input layer by 4. Since we are jumping 4 pixels at a time => smaller output volumes spatially.

