# Structuring machine learning projects

## Week 1

### **Introduction to ML strategy**

### Why ML strategy?

While making a model, we might have a huge amount of ideas that might improve the model. However, it's easy to get lost in the chaos or follow false leads so structuring the ideas is important.

This course aims to provide tools to analyze the ML problem at hand and to find the best solution to solve/improve it.

The course is very much based on practical experience of Ng and contains ideas that might not see as much light in university lectures.

### Orthogonalization

Orthogonalization through examples:  
* An old tv, which has separate knobs for controlling picture width, height, shape etc. Each knob has exactly one function, instead of each of the knobs changing all the properties slightly.  
* A car has a wheel for steering and peddles for controlling speed. So each part of the car has a separate function. Imagine instead, that by turning the wheel right the car speeds up and steers right and by turning the wheel left the car slows down and steers left. This would be much harder to drive.

Orthogonalization as a concept basically means that you want a separate "knob" to control each individual feature. You do not want to mix multiple things together behind a single knob, because that becomes very hard to control.

Orthogonalization comes from orthogonal. Think of axes that are orthogonal to each other. You want to move the "slider" along a single axis, not somewhere between multiple.

So basically you want to have a clear toolset to achieve each individual improvement. For machine learning this means that you have a specific set of tricks to try when you want to a) make the model fit the training set better, b) make the model to fit the dev set better etc.

For example, early stopping is a method that is less orthogonalized since it simultaneously affects a) and b). This makes it more difficult to use successfully.

<img src="notes_images/ortho.png" width="700">

### **Setting up your goal**

### Single number evaluation metric

When we are trying new model types, we want to quickly see what is the best model, so we can keep iterating and making it better. Therefore, it's handy to choose a single metric, a single real number, to describe how well the model is doing on the dev set.

**The *F1-score* is a standard metric used in literature. It combines (harmonic mean) the metrics *precision* and *recall* into a single number**. Calculating *F1-score* with a good dev set allows us to see with a single glance which model type is the best.

<img src="notes_images/F1_score.png" width="700">

If we have multiple F1-scores in multi-class classification (cat recognizion accuracy in Finland, India, China, Namibia, other), you can average over the scores to quickly see which model type did the best at all of the sites.

### Satisficing and Optimizing metric

Sometimes it's not possible to include all the information for choosing the model in a single real number - maybe we care about both accuracy (e.g. F1) and running time.

If there are multiple metrics we care about, such as accuracy and running time, we should choose one metric as an *optimizing metric* and (some of) the others as *satisficing metrics*.

**The optimizing metric (e.g. accuracy) is something that we want to be as good as possible within the limits of the satisficing metric(s) (e.g. running time, # of false positives)**. For example, we might want the accuracy to be as good as possible while having a running time of less than 100 ms. We would choose the model that best satisfies these conditions.

<img src="notes_images/sat&opt.png" width="700">

### Train/dev/test distributions

It's important to properly set up the train set, dev ("development" or "hold out cross validation") set and test set to keep the workflow efficient.

You should always take your dev and test sets from the same distribution. 

So for example, do NOT choose a dev set with cat pictures from the Nordic countries and a test set with cat pictures from Southern Africa. This way you might spend months optimizing the model on the dev set only to find out it works horribly with the test set. Instead, take all the cat pictures and randomly shuffle them into a dev and test set so both sets have Nordic and Southern cat pictures.

In a nutshell: **Choose a dev set and test set (from the same distribution) to reflect data you expect to get in the future and consider important to do well on.**

We will talk about the training set later. The training set basically defines how well we are able to optimize the model to the dev set.

### Size of the dev and test sets

Back in the day a 70-30 % train-test split was common. Or a 60-20-20 % train-dev-test split. These were reasonable splits back when the data sets were small.

However, deep learning is very data hungry so it's beneficial to have as much data for training as possible.

**Nowadays, if you have 1,000,000 examples, it can be reasonable to do a 98-1-1 % split. This still yields 10,000 examples to both dev and test sets, which might already be enough.**

The purpose of the test set is to help us understand how well our model might actually do in a real-world scenario. Consequently, the test set should be big enough to give us a high confidence in the overall performance of the model. This size can be 10,000 or 100,000 or different depending on the application. Sometimes people do not use a test set (only train and dev sets) but that's not really recommended.

### When to change dev/test sets and metrics

Sometimes we might realize during the ML process that we didn't define the dev/test sets or metric properly and need to change them.

The evaluation metric we chose might not work as intended: e.g. it might have the highest accuracy but also let through porn images. In this case, we need to revise the metric and artificially give the porn images a higher error. This will involve changing the metric and possibly going through the dev/test sets to label porn images.

Another case might be that your dev/test sets contain only perfect images, while the users will also submit blurry ones. In this case, you might need to change your dev/test sets by including more amateurish images and try to do better on them by tweaking your metric.

Here we have an example of orthogonalization, in two separate steps:  
1) Place the target, choose what metric we want to optimize.  
2) Figure out how to hit the target as accurately as possible, e.g. change the cost equation to do well on the chosen metric.

**So, try to think of a good metric and good dev/test set when starting out to make the ML process go quicker. But, if you cannot immediately come up with a "perfect" metric, start with something and change it once you realize it's not working.**

### **Comparing to human-level performance**

### Why human-level performance?

Why do we want to compare ML systems to human-level performance:

1) It's actually feasible to reach human-level performance with modern deep learning algorithms.  
2) The workflow is more efficient when we try to solve a problem humans could also do.

**Typically ML model accuracy progresses fast until it surpasses human-level performance. After this it slows down. The maximum accuracy that the model can have is called the *Bayes (optimal) error*.**

<img src="notes_images/hlp.png" width="700">

The progress slows down, because:  
1) **The human-level performance is close to Bayes error for many tasks.**  
2) It is easier to train a model up to human-level performance than past it. As long as the ML algorithm is worse than humans, we can:  
  * Get labeled data from humans.
  * Do manual error analysis, why did a person get it right if the algorithm got it wrong.  
  * Get a better analysis of bias/variance (following lectures).

### Avoidable bias

Knowing how well humans do with a certain problem, e.g. classification, we can understand how well we want the ML algorithm to do on the training set. This is based on point 1) above.

For example, **if humans can recognize cat images in the training set with an accuracy of 1 %, a ML algorithm with 8 % is not great and is likely underfitting (high bias). If the human accuracy is 7,5 %, a ML algorithm with 8 % is doing great and we can focus on reducing overfitting (high variance).** So basically, knowing the human-level error can help us understand if we should focus on bias or variance.

<img src="notes_images/abias.png" width="700">

### Understanding human-level performance

**The "human-level error" can have different definitions but often it is useful to define it as the best result humans could possibly obtain. This way, it can be used to approximate Bayes error as accurately as possible.**

### Surpassing human-level performance

Once the ML model surpasses human-level performance, it can be hard to deduce the Bayes error and hence understand if we are dealing with a bias or variance problem. Furthermore, at that point we can no longer use human-labeled data to efficiently teach the model.

Some problems, where ML significantly surpasses human-level performance:  
* Online advertising  
* Product recommendations  
* Logistics (predicting transit time)  
* Loan approvals

All of the above problems rely on structured data (big databases of tables). They are not natural perception problems like computer vision, speech recognition or natural language processing.

**In problems involving large amounts of structured data, ML systems are typically superior to humans. However, it is much harder for ML systems to surpass humans in natural perception problems.** However, ML systems can and even have surpassed humans in certain problems of speech recognition, computer vision and medical examination.

### Improving your model performance

To improve the performance of our ML model we should first try to identify if we're dealing with a high-bias or high-variance problem. Then, we should try to solve the problem using the standard solutions:

<img src="notes_images/imp_m.png" width="700">

## Week 2

### Carrying out error analysis

Error analysis means manually investigating the mistakes of our model. It is possible to improve the model this way as long as the model is below human-level performance.

**Typically, during error analysis we manually check how many dev set images were classified wrong and why. Then, we can decide if it is worth it to try and improve the classification.** For example, if 50/100 (5/100) incorrectly classified images were labelled as dogs instead of cats, it's probably (not) worth it to make the classifier better at differentiating between cats and dogs.

It can also be a good idea to evaluate multiple improvement ideas simultaneously while skimming through the incorrectly labelled dev set examples. E.g. checking which images could be fixed by improving detection of dogs/blurry pictures/lions... Then, we could **focus on improving errors on the idea with the highest ceiling (improvement % in detection).**

<img src="notes_images/er_an.png" width="700">

### Cleaning up incorrectly labeled data

It is naturally always good to have correct labels. However, sometimes correcting erronous labels is not worth the time and effort relative to how much it increases the model performance.

*Training set*

**DL algorithms are quite robust to random errors (incorrectly labelled data) in the training set.** So if the errors are more or less random (random misclicks by labeller etc.), it's usually fine to leave them as they are. Of course they can be fixed, but this will take time and the algorithm might be just fine with a few random errors.

**However, systematic errors in the training set can mess up the algorithms.** So, if almost every white dog is incorrectly classified as a cat, this will cause problems.

*Dev set*

During manual error analysis, we can check which images were wrongly labelled. Then, we can check the percentage of errors (like previously for dogs/blurs/lions...) and figure out if it's worth the time to correct them or not.

The purpose of the dev set is to help us decide between different model types. If the errors are commonplace enough to hinder this, they should be corrected.

*Correcting dev/test sets*

* **If we decide to correct the dev and test set labels, we should correct them using the same process, so they continue to be of the same distribution.** 
* It is good to remember there might be false positives (examples the algorithm got right incorrectly), not just false negatives, even though going through FPs could be very time consuming. 
* Correcting just dev/test but not training will cause them to come from slightly different distributions. This is generally ok.

Sometimes researchers make it sound like there is no need for human involvement in DL, just feed the data and let the model do its thing. Some even talk about this in a negative light, as in you should not inspect the errors manually. However, according to Ng, this is silly talk. While tedious, it can save a lot of time in the long run to take a look at the errors and try to understand where the model is going wrong. **So manually inspecting errors and using this knowledge in the model development is totally ok.**

### Build your first system quickly, then iterate

When working on a new ML application, it's a good idea to set up a dev/test set and evaluation metric quickly. Then, build the initial system and start iterating. **Having an initial system (even a simple one) allows us to always choose the next step systematically based on bias/variance & error analysis.**

If the problem is well-known, it can be built with more care based on existing academic literature.

### **Mismatched training and dev/test set**

### Training and testing on different distributions

Since DL requires great amounts of training data, it is becoming increasingly common to shove web-scraped etc. data into the training set. This typically leads to a training and dev/test set that come from different distributions.

**We should keep in mind that the dev/test sets should represent the (target) data we want the model to work on in the future.** So for example, if we want to make a mobile app for recognizing user submitted cat images, our dev/test sets should consist of those images. We can add some user images to the training set, but the training set can also contain a large amount of web-scraped cat images. However, we should not add any web scraped data into dev/test, since our target is the users' amateurish images.

In a nutshell, **it is ok to have different distributions for training and dev/test. Using a lot of data for training will improve the model** compared to trying to train with whatever little data is available from the dev/test distribution.

### Bias and Variance with mismatched data distributions

Analyzing bias/variance helps us prioritize what to work on next. But the way we analyze them changes, if our training set comes from a different distribution than the dev/test sets.

**During analysis, we usually compare the training error and the dev error. However, when the distributions do not match, we can no longer make conclusions based on the error percentages.** For example, if the training error is 1 % and dev error 10 %, we cannot say for sure if the difference is caused by: 
1) The algorithm not generalizing well to the dev set (variance problem).  
or  
2) The dev set distribution having images that are harder to classify (data mismatch problem).

**To solve this problem we define a *training-dev set*. It has the same distribution as the training set, but it is not used for training the NN.** So the training-dev set is a piece carved out of the training set and set aside during training.

Now, **we can inspect the error on the training set, the training-dev set and the dev set.** For example: 
* if the respective errors are 1 %, 9 % and 10 %, we can conclude that there is a variance problem (overfitting), since the trained NN does not generalize well to unseen data of the same distribution.  
* if the errors are 10 %, 11 % and 12 %, and we know that human-level error is 0 %, we can conclude that there is an avoidable bias problem.  
* if the errors are 1 %, 1.5 % and 10 %, we can conclude that there is a data mismatch problem, since the NN does generalize well to data of the same distribution but it does not do well on the dev set distribution that we care about.

We can also have combinations of the above scenarios such as 10 %, 11 % and 20 %, which would be an avoidable bias + data mismatch problem.

Or, we might have 7 %, 10 % and 6 %, which would be a data mismatch problem where the dev set is easier to classify than the training set.

For a more general approach, see the image. The stuff in the red box matters the most, but the other dev set numbers can help too.

<img src="notes_images/bw_mm.png" width="700">

### Addressing data mismatch

We have learned methods for solving bias/variance problems, but the data mismatch problem is a new one. There are no completely systematic approaches for solving this problem, but there are some suggested methods we can try.

**Manual error analysis can help us identify the difference between training and dev set**, e.g. maybe the dev set has more noise than training set. 

**Once we have identified the difference, we can collect more training data similar to dev data or try to make the sets more similar**, e.g. by adding simulated noise to the training data. The latter is called *artificial data synthesis*.

**We can use artificial data synthesis to e.g. combine clear audio with car noise to create "in-car audio".** However, if the duration of our clear audio is much longer than the car noise, we cannot just loop the car noise. Looping the audio might sound fine to the human ear, but a NN will notice the repetitiveness, which might cause overfitting to this specific type of car noise.

<img src="notes_images/adm.png" width="700">

In image recognition, artificial data synthesis can mean e.g. that we use computer generated images of cars. While this works, we should again be careful that we have enough different cars, so that the NN won't overfit to a small subset of all cars.

**Artificial data synthesis does work, we just need to be vary that we are not simulating data from a small subset of all possible cases.** This can cause overfitting.

### **Learning from multiple tasks**

### Transfer learning

**Sometimes we can transfer the knowledge gathered by a neural network from one task to another.** For example, maybe we can take a NN taught to recognize cats and use (some of) that NN's knowledge to help us read X-ray scans. This is a powerful DL trick called *transfer learning*.

In the simplest case, we can just remove the output layer and its weights and replace them with a new output layer with randomly initialized weights. Then, we can use our new data (x, y) to:
* retrain the weights of the output layer or the last few layers, if the new dataset is small.   
* retrain all the weights in the NN, if the new dataset is large.

If we retrain all the weights, we use the term *pre-training* to describe the training done with the old dataset and the term *fine-tuning* to describe the training with the new dataset.

**Transfer learning works, because the earlier layers typically recognize low-level features like edges and curves.** This knowledge about "how images generally look like" can then be applied to almost any image-recognition task. Naturally, the same principle also applies to speech recognition and other fields.

<img src="notes_images/tl.png" width="700">

Transfer learning **makes sense, when:**  
* we transfer from a task with a lot of data (1,000,000 examples) to a task with little data (100 examples). The other way around does not make sense.  
* both tasks have the same input x.  
* low level features from the first task could be helpful for learning the second task.

Transfer learning is a very commonly used method in DL.

### Multi-task learning

In transfer learning, we learn one task and sequentially transfer this knowledge to another task. However, in *multi-task learning* **we have a NN learn multiple tasks simultaneously. Each of these learned tasks helps the NN learn the other tasks.**

For example, **an autonomous vehicle needs to learn to simultaneously recognize pedestrians, cars, stop signs and traffic lights.** The corresponding NN would have an output layer of 4 neurons with values {0, 1}, one for recognizing each object. Each image fed to such a network can have multiple labels (car + pedestrian + traffic light). Training such a network is an example of multi-task learning

In multi-task learning we are building a single NN that is solving multiple tasks at the same time. The alternative would be building a separate NN for each task. However, multi-task learning is much more efficient as long as the earlier layers in the NN are useful for all the learned tasks.

Multi-task learning works even if some of the output labels are missing for some examples.

<img src="notes_images/mtl.png" width="700">

Multi-task learning **makes sense, when:**  
* training on a set of tasks that could benefit from having shared lower-level features.  
* the amount of data we have for each task is similar (not a hard rule). For example, having 100 tasks with 1000 examples each will allow all the tasks to essentially have 100,000 examples - the tasks boost each other's performance.
* we are able to train a NN that is big enough to do well on the all tasks.

In practice, setting up multi-task learning can take a lot of effort, since it can be difficult to define so many tasks to train simultaneously on. Consequently, **multi-task learning is used much less often than transfer learning.** Nevertheless, in fields like computer vision multi-task learning can be very useful.

### **End-to-end deep learning**

### What is end-to-end deep learning?

Certain data processing systems have required pipelines with many individual steps. *End-to-end deep learning* **turns a pipeline into a huge neural network that directly maps the inputs to the outputs.**

**When E2E works, it works really well and greatly simplifies the system** since individual hand-designed components are not needed. However, **it does not work for all problems.**

End-to-end DL requires large amounts of data (10,000-100,000 h of audio) to really shine. For small datasets (3000 h), pipelines often work as well or better.

An E2E approach can also be partial - skipping some of the pipeline steps while keeping others.

<img src="notes_images/e2e.png" width="700">

**A pure E2E approach is not always the best.** For example, to recognize people's identities in images it is typically better to split the process into two steps instead of using a single E2E NN. The first step finds the face in the image and the second step compares a zoom-in of the face to a database of faces. This works better than E2E, since there is **a lot of data available for training the individual steps but not as much for the direct E2E approach** ("image -> identity").

A pure E2E approach works better than a pipeline in, for example translating English to French. There is enough data available to do such an X -> Y task directly without any substeps.

Historically, accepting end-to-end deep learning as a viable option has been difficult for some researchers who have spent most of their scientific careers designing the individual pipeline steps.

### Whether to use end-to-end deep learning

Here we go through the pros and cons of E2E DL to try and obtain a more systematic understanding of whether to use it or not.

**Pros of E2E DL:**  
* Lets the data do the speaking instead of trying to force the NN to think like humans (e.g. how words are supposed to be heard).
* Less hand-designing of components needed.

**Cons of E2E DL:**  
* May need large amounts of data.  
* Excludes potentially useful hand-designed components. 
  * A learning algorithm has two main sources of knowledge: 1) data, 2) hand-designed components. If data is lacking, a well hand-designed algorithm can still work well.

<img src="notes_images/we2e.png" width="700">

For example in autonomous driving, E2E DL is not the most promising method. It is easier to split the problem into multiple subtasks.