# Convolutional Neural Networks

## Quick Review of Neural Networks

We now understand how to perform a calculation in a neuron:

`z = W * x + b`

`a = act_function(z)`

We also know that there are several activation functions available, such as Perceptrons, Sigmoid, Tanh, ReLU (some others will be discussed shortly).

Connecting single neurons will result in a Neural Network:

- Input Layer
- Hidden Layers
- Output Layer

More layers imply more abstraction.

In order to learn we need some measurement of error and feedback. We use a cost function that can be:

- Quadratic
- Cross Entropy

Once we have the measurement of error, we need to minimize it by choosing the correct weight and bias values. We use Gradient Descent to find those optimal values.

We can then backpropagate the gradient descent through multiple layers, from the output layer back to the input layer. We also know of dense layers (fully connected to the next layer), but later on we will introduce *softmax* layers.

## New Theory Topics

### Weights Value Initialization
So far we have initialised values for weights using random values, but that isn't the only option:

- Zeros: No randomness, and not a great choice
- Random distribution near zero: not optimal, and distorts the activation function

But there are more:

- Xavier (Glorot) initialization: Uniform/Normal. Draw weights from a distribution with zero mean and specific variance 1/n, where n is the number of neurons.

### Learning Rate and Batch Size

The learning rate defines the step size during Gradient Descent, and the Batch Size allows us to take a small sample of the data to apply Stochastic Gradient Descent, as a trade-off between the representativeness of the data and the training time.

The Second Order Behavior of the Gradient Descent allows us to adjust our learning rate based off the rate of descent, meaning the learning rate is variable instead of a fixed value:
- AdaGrad
- RMSProp
- **Adam**

Larger steps at the beginning, and eventually going to smaller step sizes as we approach the optimal value.

### Unstable or Vanishing Gradients

As we increase the number of layers in a network, the layers towards the input will be affected less by the error calculation ocurring at the output as you backpropagate through the network, specially if the network is very deep. Initialisation and Normalisation will help mitigate this processed, known as **vanishing Gradients**.

### Overfitting vs Underfitting a Model

If we get a larger error on the test data compared to the train data, then we are **underfitting**.

If the train error is very low, and test data has a huge error, then we are **overfitting**. Fitting too well makes the model meaningless for predictions.

Need some sort of balance. With potentially hundreds of parameters in a deep learning neural network, the possibility of overfitting is very high!

There are a few ways to help mitigate this issue:

- L1/L2 Regularization: Adds a penalty for larger weights in the model (not unique to neural networks).
- Dropout: During training, remove neurons randomly, so that the network doesn't rely on any particular neuron.
- Expanding data: Adding noise, tilting images, low white noise to sound data, etc...