# Week 3 Notes

## Carrying Out Error Analysis

A good way to reduce errors is to manually look at a random sample of the errors that the ML system is making. One may have initial ideas of errors from a handful of examples of prior experience.

However taking the time to look at a random sample of 100 - 200 is a systematic way of doing this and providing insights on the categories of errors and best suggestions on how to reduce overall errors.

A good strategy is to tabulate individual errors, maybe using a spreadsheet. Start with a number of predefined labels/categories (maybe from prior experience and/or intuition). As each example is looked at, identify it is one the error labels preset. If not, consider adding a new label for the type of error. In addition, make side notes that might be useful.

After having labeled all the errors, tally up the counts on each label. At this point, we can understand the maximum improvement by working on a particular category of error.

The most natural suggestion would be to focus on reducing/removing the kind of errors that are most frequent. This may involve additional features, collecting more data, particular kinds of regularizations being applied, or it might involve changing the model family in some way. However it may depend on the category of error what resources are available to reduce errors. 

There is a ceiling on the improvement that can be achieved by reducing the error in each category - namely the frequency of that category of error.

Using this disciplined procedure, the engineer can decide what is the best way to use resources to reduce errors.


###  Experience Insight

For some types of data this may not work well. For example, with structured data that has thousands of features it is hard to decide exactly what is going on with the errors.

For classification tasks on structured data, it's very difficult to decide which of thousands of features are causing errors: categorizing errors into labels is difficult.

It might be possible concentrate on just the errors, project features to a small dimensional space and then look at if this provides insight into what might be driving the errors.


## Cleaning Up Incorrectly Labeled Data

Deciding how much resource to have for fixing incorrectly labeled data and where to allocate resources between training set and dev/test set.

### Mislabeled Data in Training Set

#### Data Mislabeled Randomly 

If there are a small percentage of random mislabeled data in the training set, then because of the large amount of examples in the era of big data - ML and particularly DL models are robust to a small percentage of random mislabeled data. We can ignore such mislabeled data, not investing resources in relabeling the mislabeled data.

####  Data Mislabeled Systematically

If the data is mislabeled systematically with some structure, even if it is a small percentage of the data, then this a significant problem for which most models are not robust.

### Mislabeled Data in Dev and Test Sets

#### Mislabeled Data: Small Percentage of Performance/Error Metric

If the mislabeled data is a small percentage of the error metric, then we should focus on reducing the avoidable bias or variance of the model. Having a relatively low ceiling, then there is relatively little value in addressing mislabeled to begin with.

#### Mislabeled Data: Are Large Percentage of Performance/Error Metric

If the mislabeled data is a large percentage of the error metric. Having a relatively high ceiling for improvement, then there is real value in fixing mislabeled examples. This will boost the metric substantially and may improve learning somewhat.


#### Experience Insight

In a regression setting, there isn't really mislabeling of outputs - potentially of inputs if some are categorical.

### Train vs Dev and Set Set

It is important that the dev and test set come from the same distribution to give a fair target of the performance to be improved. Due to this, if mislabeling is going to be fixed, it should be done to both the dev and test sets.

For the train set, there are scenarios where it is acceptable for the train set to have a different distribution from the dev/test set. In particular if there is so much data that the model should be robust to some mislabeling in the train set, but the mislabeling proportion is a large percentage of the error metric, then it is fine to fix mislabeling only in the dev/test set.


## Build First System Quickly - Then Iterate

Often for a problem, we have ideas about the complexity of a data set and problem. We might be tempted to perform all kinds of procedures while not explicitly setting up a machine learning harness.

For example in a speech recognition task, our prior experience might suggest:

* background noise

* accents

* distance from microphones

* A child's speech pattern (um, ah,...)

One might be tempted to create synthetic examples of background noise, or regularize accents, or apply filters to improve far sound problems, maybe check for mislabeled examples. Although these might be good ideas, they aren't addressing the problem directly or systematically.

However, this is really not ideal. This is trying to optimize many things without a framework and it's really not clear how the performance improves - somewhat like a stochastic search. 

Instead we suggest that there should be a build quickly first mentality. The reason is that by setting up clear mindset/code/notes - essentially a ML harness we can then iterate and improve. So the first cut will have:

* target (possibly balanced)
* features (possibly munged, ingested from several sources, imputed)
* performance metric
* splits of train, dev and test sets

This sets up a clean representation of data and performance metric. Following this, all that is needed is to iterate and improve performance using the strategies described in this course.

## Training And Testing On Different Distributions

Sometimes it is justified to train on data that is not the same as the dev/test set. The major reason is to add more examples for training of data hungry models - because there aren't enough examples for the actual task. In such a case the best strategy is to put all the additional data into the training data. The intuition is that the additional data is similar enough to get into a reasonable space for the parameters. Then the dev set will fine tune the hyperparameters for the actual task and the test set will give an unbiased estimate of the performance on the actual task.

### Data From Related Distribution to Dev/Test

The reason for using related but different data in the training set is the same as transfer learning. The hope is that the two distributions are very similar at the lower levels of a learning algorithm so that they are able to learn the same features.

#### Example 1: HQ Cat Images from Web vs Blurry Phone Camera Uploads

In this example we can have much higher quality images from the web of cats, vs  much less blurry poor lighting examples from user cameras. The two should however learn similarly at the lower layers of a learning algorithm.

#### Example 2: HQ Speech Audio in Room vs Audio from Car with Background Noise

In this example we have HQ speech recognition in a room and much less data for noisy car recordings. The two should however learn similarly at the lower layers of a learning algorithm.

In each case, all the related data (related but not same as task distribution) should go in the training set. Maybe some of the fewer task examples can go in training also - if there is enough. However all the dev and test data should come from examples for the actual task.


## Bias And Variance With Mismatched Data

### Recap on Bias and Variance


### Introducing Training-Dev Set
Because we dev and test using a related but different distribution from that of the training set, we cannot rely on the dev set to get a sense of variance. We can still estimate avoidable bias as that which is above the bayes error metric and then drive this down.

The dev error metric however is not useful, because both the distribution and the data set have changed in the hyperparameter training process.

### Estimating (Training-Dev) Variance on Same Distribution

Since we want to control how metrics change one at a time, we define a new split of the data - one we call training-dev set. This is a split cut from the training set - so as to maintain the same distribution of the two sets.

As a consequence we can get a sense of the variance of the data by comparing the error metric for training vs training-dev set.

Then the dev error metric can be compared against the training-dev error to understand the error caused by data mismatch - (the error caused because the training set and dev set have different data distributions).

The test set should be used to get an understanding of the error metric to be expected on unseen data, it should be used for training. Note the following relationships.

Given:

- Bayes error metric

- training error metric

- training-dev error metric

- dev error metric

- test error metric

We can get a sense of the following measures:

- Bias ~ (training error metric - Bayes error metric)

- Variance ~ (training-dev error metric - Bayes error metric)

- Data Mismatch ~ (dev error metric - training-dev error metric)

Data mismatch can be optimized by the strategies that follow.

## Addressing Data Mismatch

For now there are no systematic strategies to remedy data mismatch. However, two commonly usd and well respected ones are:

- Manual error analysis to get a sense of what types of errors occur most frequently.

- Adjust the training set data to be more similar to the dev/test set via some artificial data synthesis transforms applied to the related training data. Examples include:

    - Add background noise to clean audio clip data more similar to in car audio that a smart speaker maybe listening to
    
    - Use synthetic cars image progarms from video games to create different views of cars.

### Dangers of Synthetic Data

Synthetic data can be a problem if it is not transformed from a wide enough space of possibilities. If the synthetic data has the same small kernel of data from which examples are being generated, then it can be a problem if the model overfits to this smaller set.

### Data Augmentation Strategies

Sometimes we might want to augment data, particularly unstructured data. There are some strategies for this, including applying the following:

- flipping
- rotating
- cropping
- adding gaussian noise
- scale
- translate

There are structured data equivalents for all of these transformation, but they do tend to apply more to unstructured data.

## Transfer Learning

Several variations of transfer learning are used. The premise is that for a deep learning model, it may be possible to use lower levels of learning from other data on task A to train for a different task B. The assumptions when this is useful are:

- tasks A and B take the same shape of feature data
- tasks A and B have similar low level 'features' or embeddings which can be learned from either.
- tasks A has several orders of magnitude more data than task B.

In general we take the trained weights of task A and do one of the following:

- freeze all the weights apart from the final output layer, then retrain it with the features for task B to get the best classifier for the ouputs of task B using the last but one layer to learn the output. Effectively the last from one layer is acting as features to a generalized linear model such as a logistic classifier.

- freeze some of the earlier layer weights and train the rest above - this allows for a non-linear model.

- unfreeze all of the layer weights and rely on 'fine tuning' to a local optima using the weights from task A as a initial starting point to optimize the model for task B - remember to replace the output layer of task A with that of task B.

- freeze all the weights from task A, remove the output layer of task A, and add more hidden layers so that task B can learn non-linearity (several hidden layers) from task A penultimate layer to task B output.

### Example: Radiological Diagnosis

Take a general image classifier such as VGG-16 and remove the top layer. Then transfer learn from to a set of radiological data.

### Example: ULMfit word2vec to amazon reviews

Train  word2vec model on Wikipedia dump, then transfer learn train amazon reviews to produce +ve/-ve sentiment.


## Multitask Learning

In multitask learning we assign to each sample a set of target labels. 

Multilabel regression: This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.


Multioutput regression assigns each sample a set of target values. This can be thought of as predicting several properties for each data-point, such as wind direction and magnitude at a certain location.

The cost function is a sum of various terms, each one a loss function for a label, averaged over all examples for the data set. For a missing label, the loss function will be zero for that particular part. 


## What is End-To-End Deep Learning

End-to-End learning takes inputs and maps to outputs via hidden layers avoiding any feature engineering. This requires several orders of magnitude more data. 

In general, this is fine. This is because we now have access to big data and so having large amounts of data that can be used for deep learning with many hidden layers.

If there is little data or significant insight into the the problem, then feature engineering may be valid.

A famous researcher Yann LeCunn tweeted that:

```
Just heard an interview with the author of "Bullshit Jobs". Was surprised to find out the book isn't about feature engineering for shallow models #torched
```

Nonetheless there may be justification for feature engineering at some times.


## Whether To Use End-To-End Deep Learning

### Pros

- Let the data learn itself
- Less hand designing of components needed

### Cons

- May need large amounts of data

- Excluding injection of hand designed components (useful when we don't have enough data)

### Example: Autonomous Cars

Take various inputs (lidar, images, sensors) and map to image classifications (car, pedestrian, road, pavement) then route and steering done with motion planning. This is not an end-to-end machine learning approach.

For self driving cars, end-to-end learning is not currently a promising approach.
