# Model Training Patterns

## Stochastic Gradient Descent
- On large datasets stochastic gradient descent (SGD) is applied to mini-batches
- This is called stochastic SGD and extensions of SGD e.g. Adam, Adamgrad etc... are the de facto optimiser used in modern-day machine learning frameworks

- SGD requires training to take place iteratively on small batches therefore training happens in a loop
- SGD finds a minimum, but not a closed-form solution, and so we have to detect whether the model convergence has happened
- As a result, the error (called the loss) on the training dataset has to be monitored.
- Overfitting can happen if the model complexity is higher than can be affored by the size and coverage of the dataset
- It is difficult to know if the compexity is too high until we actually train the model on the dataset
- Therefore, evaluation needs to be done within the training loop and error metrics on a witheld split of the training data (validation set).
- Because training and validation datasets have been used i the training loop it is necessary to withhold yet another split of the training dataset called the test set.
- Metrics are reported on the test set

## Design Pattern 11: Useful Overfitting

- Want to intentionally overfit on the training dataset
- Perform training without regularisation, dropout, validation dataset or early stopping

### Problem

- Goal of a ML model is to generalise well to make good predictions on unseen data.
- If the model overfits then the ability to generalise suffers and so do future predictions

- For example, imagine we have system to model the physical enviroment
- The model carries out iterative, numerical calculations to calculate the precise state of the system
- Suppose all observations have a finite number of possibilites e.g. temperature is limited to 60 - 80 degrees celcius in increments of 0.01.
- We can then create a training dataset for the ML system consisting of the complete input space and calculate lavels using the physical model
- Splitting the training dataset would be counterproductive because we would then be expecting the model to learn parts of the input space it will not have seen in the training dataset.

### Solutions

- In the above scenario, there is no "unseen" data that needs to be generalised to, since all possible inputs have been tabulated
- If all possible inputs to a model can be tabulated there is no such thing as overfitting

- Typically, overfitting of the training dataset in this way causes the model to give misguided predictions on new, unseen datapoints
- The difference here is that we know in advance there won't be unseen data

### Why it Works

- If all possible inputs can be tabulated, then an overfit model will make the same predictions as the "true" model if all possible inputs are trained for, so overfitting is not a concern

- Overfitting is useful when:
    - There is no noise, so the labels are accurate for all instances
    - You have the complete dataset as your disposal (you have all the examples there are). In this case, overfitting becomes interpolating the dataset

## Design Pattern 12: Checkpoints