# Structuring Machine Learning Projects

## Introduction to ML Strategy

The challenge with deep learning is that there are many ways to improve a model:

* Gather more data
* Train the algorithm for a longer time
* Change the architecture of the neural network
* Get more diverse training set

However, pursuing the wrong strategy can result in an importat loss of time and resources. You could be spending 6 months gathering more data only to realize that it barely improves the model. 

Therefore, it is important to have a good strategy when it comes to managing and improving machine learning projects.

### Orthogonalization

The most effictive ML practitioners have a clear view of what parameters to tune in order to achieve a better result.

Orthogonalization refers to having a control with a very specific function. A single control should impact only one variable or parameter. That way, it is much easier to tune to achieve an expected result. If multiple controls can affect a  single variable, then it is much harder to achieve an optimal result.

**Show orthogonal graph**

How does that translate to machine learning?

We need to consider the chain of assumptions in machine learning. It is assumed that if the model performs well on the training set, then it will perform well on the dev set and on the test set. Similarly, it should therefore perform well in the real world.

* Training: bigger network or change optimization algorithm
* Dev set: use regularization or bigger train set
* Test set: use a bigger dev set
* Real world: Change dev distribution set or change cost function

## Setting up your goal

### Single number evaluation metric

Having a single number evaluation metric allows for quicker assessment of an algorithm.

For example, for a classifier, precision and recall are common evaluation metrics. However, there is a tradeoff between these two metrics. Instead, we should use a metric that combines both.

In this case, we use the F1 score, which represents the harmonic mean of precision and recall. That way, it is much easier to assess the quality of different models, and it speeds up iteration.

### Satisficing and optimizing metrics

Suppose you are concerned by both the accuracy and running time of a classifier

Now, you would like to maximize accuracy while keeping the running time small (say less than 100ms). Therefore, the accuracy is the *optimizing* metric and the running time is the *satisficing* metric.

In general, if there are many metrics that you wish to consider, 1 should be an optimizing metric, and the rest should be satisficing.

### Set up train/dev/test distributions

The way these sets are set up can really make a difference between slowing down a team or increasing its efficiency and speed of iterations towards the right direction.

Aa a general guideline, the dev set and test set should reflect the data that you expect to get in the future and consider to do well on. 

For example, a credit modelling alogrithm should not be trained on low income instances if it is to be deployed with medium-income  individuals.

In other words, they must come from the same distribution.

### Size of dev and test sets

How large should they be?

The old way of doing it was a 70/30 train/test split or 60/20/20 for train/dev/test.

This is still valid in the case where data is not abundant.

However, in this new era of deep learning, huge amounts of data are available. Therefore, it is better to do a 98/1/1 train/test/dev split.

For the test set, it should be big enough to give high confidence in the overall performance of the system. This could be 10 000 or 100 000 examples that would represent less than 10% of the available data.

## Comparing to human-level performance

### Why human-level performance

In the last few years, we have been comparing machine learning systems to human-level performance. This is possible, because the performance of many of these systems are very good.

Often, progress is fast as it approaches human-level performance. After though, progress slows down until it reaches the bayes optimal error. This is the error where no possible function can be accurate at a 100%. For example, a picture is too blurry or a sound sample is too noisy.

Now, humans are very good at a lot of tasks involving natural data. While your algorithm is not as good as humans you can:
* get labeled data from humans
* perform manual error analysis

### Avoidable bias

Human performance is a good estimate of the Bayes error. Now, if the training error is far for the human-level performance, then you should focus on reducing bias and increasing the performance on the training set.

However, if on the training set, the system performs closely to human-level, then you should focus on reducing variance and improving performance on the dev set error.

The avoidable bias is the difference between human-level performance and the algorithm's bias

### How to improve a model?

Reduce bias:
* train bigger model
* train longer/better optimization algorithm(Adam, momentum)
* Change NN architecture/hyperparameters (use RNN or CNN)

Reduce bias:
* more data
* regularization (L2, dropout, data augmentation)
* Change NN architecture/hyperparameters (use RNN or CNN)

## Error Analysis

### Carrying out error analysis

Analyze the errors and determine what would be the best outcome.

For example, 50 images were misclassified out of a 100, then your error rate will improve by 50%!

### Cleaning up incorrectly labeled examples

DL algorithms are quite robust to random errors in random set. Not worth correcting if the amount of errors is small and random. However, they will suffer with systematic errors.

* Apply same process to dev/test set to make sure they continue to come from the same distribution
* Consider examining examples the algo got right and wrong
* Train and dev/test data may now come from slightly different distributions

### Build quickly and iterate

Don't build something too simple or too complex. Buid fast and use error anaylsis to iterate afterwards.

## Mismatched training and dev/test set

### Training and testing on different distributions