# Strategies for Machine Learning

There are any number of 'tactics' that could be taken to 'improve' a machine learning model. These might include:
* Collect more data
* Collect a more diverse training set
* Train the algorithm for longer
* Try a different gradient descent algorithm
* Try a bigger network
* Try a smaller network
* Try drop out
* Add $L_2$ regularisation
* Play with the network architecture.

Nice head, and plenty of strategies, but how do we decide what comes next?

Thus, the driving motivation of this notebook (and of course, Andrew Ng's course!):
* Analyse your machine learning problem and determine what is to be done next.



## Concept: Orthogonalisation

We want to understand a clear cause and effect when tuning. This is closely related to the concept of orthogonalisation.

If our controls are 'orthogonal', then each is independent to the others. We want a single control to achieve a single action.

### Chain of assumptions in machine learning

When we train a model we:
1. Fit the training set well on the cost function.
2. Fit the dev set well on the the cost function.
3. Fit the test set well on the cost function.
4. Hope that the model performs well in the real world.

Each of these has a distinct set of tools that we can apply to it.

### Tools for each step of the process

We've outlined four steps in the above. Let's explore what tools might be appropriate at each step.

1. **The model doesn't function well on the training data.** We might consider:
    * Working with a bigger network
    * Exploring different optimisation methods (tuning our parameters for Adam, for instance.)
2. **The model doesn't function well on the dev data.** We might consider:
    * Explore regularisation (dropout, $L_2$ regularisation, etc.)
    * Get a bigger training set
3. **The model doesn't function well on the test data.** We might consider:
    * Get a bigger dev set
4. **The model doesn't perform well in reality.** We might consider:
    * Change the dev set
    * Change the cost function

#### A note on early stopping

The concern with using early stopping is that it impacts on training and dev performance at the same time - it's not an 'orthogonal' tool. This is not the end of the world, but we might consider using other systems.

## Concept: Single number evaluation metric

Certainly, we generally struggle to find a single number that accurately describes how a model performs in all cases. This becomes even more important when we try to explain a model to a stakeholder, taking into account their biases and the like.

All the same we generally need a quick and reliable way to establish whether a model is 'better' or not. 

**Enter the single number evaluation metric**. This lets us quickly compare model A against model B, so as to speed up the iterative process.

### Satisficing and Optimising metrics

Sometimes we may have concepts or concerns that do not define our model performance, per se, but may qualify or disqualify it. 

We could define our metrics into 
* 'optimising' metrics - metrics we strive to improve and judge upon
* 'satisficing' metrics - metrics that simply qualify whether something is 'good enough' - similiar to the concept of 'hygiene' in other contexts

You could phrase this as: "Optimise for metric **X**, subject to requirements of metric **Y**."

### Comparing to human level performance

<img src="static/comparing_to_human_level.png">

As we train our model, we might find there are a few distinct sections.

* Provided all is trucking along quite nicely, performance grows rapidly until we surpass human level performance.
* As we surpass human level performance, the rate of improvement drops off.
* The rate of improvement slows to zero as we approach ["Bayes optimal error"](https://en.wikipedia.org/wiki/Bayes_error_rate) - this is the irreducable error. 

#### Optimising for bias or variance

Let's consider two scenarios: One where the human (or perhaps Bayes) error is 1.1%, and one where the human error is 7.5%.

| Error type | Case 1 | Case 2|
| --- | --- | --- |
| Human | 1% | 7.5% | 
| Training | 8% | 8% |
| Dev | 10% | 10% |

Now, in Case 1, there's an enormous disparity between both the training and dev set and the human performance. This suggests that we want to clear our bias first, but improving the performance of the model.

In Case 2, the difference between human and training is quite small, but there's a fair difference between training and dev. We should be focusing on varias.

## Concept: Dataset creation and distributions

How the train, dev, and test sets are created are hugely important for the performance of your model.

### Guidelines

**DO**: Choose a dev set and test set to reflect data **you expect to get in the future** and **consider important to do well on**.

Don't just sample from different pools for the training, dev, and test datasets. It's tempting to segment an dataset based on clear 'fissures' in the data. 

The risk is that your training, dev, and test datasets end up marking completely different goals, and when you go to move your optimised model to a new dataset, it works not at all!


**DO**: Set your test set to be big enough to give high confidence in the overall performance of your system. 

Split your dataset based on your training set sizes. 

If you're working the 'small dataset' size - i.e. perhaps up to 10,000 training samples - the old rule of 60/20/20 training/dev/test works well.

If you're in the world of millions of examples, we might consider splitting it down into 98/1/1 for training/dev/test - this would still be leaving 10,000 samples for testing. 

## Working with mismatched datasets

Let's say that we have a relatively small amount of polished data that I want to reserve for the dev and test dataset. Instead, the majoirty of my training dataset will come from some other, easily obtainable bulk source, mixed with a reasonable amount of my polished data.

To test this process, I may now add a new distribution - a "training-dev" set, which is a separate subset of the training set. By evaluating the error at each step, we can glean different aspects about performance:

| Error type | Rate (for example) | If large error from previous row, may be due to... |
| --- | --- | --- |
| Human | 1% |  |
| Training set | 2% | Avoidable bias |
| Training-dev set | 3% | Variance in training set | 
| Dev | 4% | Data mis-match |
| Test | 5% | Somehow overfit to dev set | 

## Concept: Error Analysis Procedure

When we are focusing on how to improve our model, we would do well to:
* Identify where the model struggles;
* Note several paths that could be taken to improve performance;
* Evaluate the potential upper bound in model performance improvement if we were to perfectly follow that evaluation.