# Fine Tuning Neural Networks

## Regularization

Regularization is defined to be any modification we make to the learning algorithm to reduce generalization error but not the training error. This allows us to ensure the neural network works for new data beyond the training set. Many regularization approaches are based on limiting the capacity of models. For example, this could be adding a parameter norm penalty $\Omega (\bold{\theta})$ to the objective function J. We denote the regularized objective function by $\tilde{J}$:

$$
\tilde{J} (\theta ; \bold{X}, \bold{y}) = J(\theta ; \bold{X}, \bold{y}) + \alpha \Omega(\sigma)
$$

Where $\alpha \in [0, \infty]$ is a hyperparameter weighing the relative contribution of the norm penalty term $\Omega$ relative to the standard objective function J. Setting $\alpha$ to 0 results in no regularization whereas larger values mean more regularization.

### Dataset Augmentation

The best way to make a machine learning model generalize better is to train it on more data. With dataset augmentation, we can modify or even add more data to provide the deep learning model with better resources to train. We could also potentially inject noise into the training data to help it generalize better.

For some models, the addition of noise is equivalent to imposing a penalty on the norm of the weights. Noise injection cn be much more powerful than simply shrinking the parameters, especially when noise is added to hidden units. Noise can be also implemented into the weights, which is something we'll be able to witness in the context of RNNs.

### Early Stopping

With large models, after a while we gone through enough training data to sufficiently represent the dataset and using more can actually lead to overfitting. This leads to training error decreasing steadily over time but validation set error will continue to rise. This means we can obtain a model with better validation set error by returning the parameter setting to the point with the lowest validation set error. This is what **early stopping** does.

### Sparse Representations

While weight decay acts to place a penalty directly on the model parameters, we can also place a penalty on the activations of the units in a neural network in an attempt to ake their activations sparse. This indirectly imposes a complicated penalty on the model parameters.

## Optimization

Beyond just making neural networks, it's important to also know how to optimize the model. Optimization means many things and includes not only increasing the speed at which the network learns but also the hardware side of things and making sure the neural networks you're training require less intense hardware.

### Optimization of Loss Function

When it comes to otpimizing the loss function, we need to remember that random search is very costly and we should carefully use gradient descent to optimize. 

When it comes to minimizing loss, how do we do it? Well to start off, loss can only really move left or right on the loss graph. We simply want to move the direction that makes the loss function. This simple principle helps us move further towards the minima.