# Overfitting

An underfitted model doesn't capture any logic:
- high loss
- low accuracy

A good model captures the underlying logic of the dataset:
- low loss
- high accuracy

An overfitted model captures all the noise, thus "missed the point":
- low loss
- low accuracy

Bias-variance tradeoff: the balance between underfitting and overfitting.

## Validation

Overfitting is the real enemy when it comes to machine learning.

Usually we can spot overfitting by dividing available dataset into training, validation and test.

Validation dataset will help us detect and prevent overfitting. All the weights and biases are updated based on the training set only. Every once in a while we pause training, take the model and apply it to the validation training set. This time, we just run the model to see the output, without updating the weights. We just propogate forwards, or in other words, we calculate the validation loss. On average it should equal the training loss. 

In the process of creating a good model, you would pause and validate a number of times. 

The two loss functions we use can be referred to as training_loss and validation_loss. 

Because the data in the training set is trained using gradient descent. Each subsequent loss will be lower or equal to the previous one. It might eventualy get to close to zero.

This is where the validaton loss comes into play, along side the training loss. At some point the validation loss could start increasing, which is a red flag for overfitting. 

If we are getting better at predicting the training set, but getting worse at predicting the validation set, we are clearly not getting it right at the overall data level.

## Testing

After we have trained the model and validated it, it is time to measure its predictive power.

Logically this means running the model on a dataset it hasn't seen before, which is equivalent to applying it in real life. 

The accuracy we get by forward propogating the test dataset, is the accuracy we expect the model to have if we deploy it in real life.

## Summary

- You get a dataset
- You split it into 3
    - training (80% / 70%)
    - validation (10% / 20%)
    - testing (10% / 10%)
- Train the model using the training dataset only, backpropogating the weights and biases
- Every now and then (usually every epoch), validate the model by running it through the validation dataset. If the training loss and validation loss are decreasing hand in hand, all ok. If the validation loss is increasing, the model is overfitting.
- Test the model with the test dataset. The accuracy obtained at this stage is the accuracy of the algorithm. 

## N-Fold Cross Validation

What if we have a small dataset and cant afford to split it? We would lose underlying relationships, or have so little data for training that the algorithm doesnt work.

N-Fold Cross Validation is a strategy for this situation. 

It combines the training and validation datasets in a clever way, however, it still requires the test dataset. 

Lets say we have a dataset containing 11,000 observations. We will save 1,000 for testing and then we have 10,000 samples for training and validation. 

This is not a big dataset. In data science you usually deal with ginormous datasets.

(Ginormous datasets have their own problems - being so large, they often have a lot of missing values. We refer to them as being sparse)

We want to train on 9,000 datapoints and validate on 1,000. We split the remaining 10,000 observations into 1,000 observations each, and fold it 10 times, so this is a 10-fold cross-validation. (10 is a commonly used value)

In the first epoch we treat the first chunk as the validation set and the others for training. In the second epoch we treat the second chunk as validation set and the others for training. And so on. 


Pros:
- utilised more data
- we have a model

Cons:
- we have still trained on the validation set, possible overfitting

## Early Stopping

We understand that we should train the model until the loss function is minimised. We can go on doing that forever, but at some point we will overfit.

This is why the validation dataset comes in to help stop that training.

Generally, early stopping is a technique to prevent overfitting. It is called early stopping as we want to stop early before we overfit.

The most common ways to do this are:
- train for a preset number of epochs (dont use this)
    - pro:
        - eventually solves the problem
    - cons:
        - no guarantee that the min is reached
        - maybe doesnt minimise at all
        - doesn't prevent overfitting
- stop when loss function updates become sufficiently small
    - pros:
        - we are sure the loss is minised (loss function stops changing)
        - saves computing power (by not iterating uselessly)
    - cons:
        - we can't guess the number of epochs
        - doesn't prevent overfitting
- validation set strategy (stop the algo when the training loss and the validation loss start to diverge)
    - pros:
        - we are sure the validation loss is minimised
        - saves computing power
        - prevents overfitting
    - cons:
        - might iterate uselessly


It might be best to use a combination of the updates too small and validation set methods, to balance out the pros and cons. 

