# Machine Learning Strategies

Machine learning strategies are changing in the era of Deep Learning

## Orthogonalization
Orthogonalization, in the context of machine learning, refers to the process of separating concerns or objectives in a model, such that each concern can be optimized independently without affecting the others. This concept is derived from the mathematical concept of orthogonality, where two vectors are orthogonal if their dot product is zero, meaning they are independent of each other.

In machine learning, the idea is to design algorithms or models that have distinct, independent components or hyperparameters that control different aspects of the model's behavior. By doing this, you can simplify the process of tuning and optimizing the model, as adjusting one component or hyperparameter will have little or no effect on the other components.

A classic example of orthogonalization in machine learning is the separation of model fitting and regularization. In this context, model fitting is the process of adjusting the model's weights to minimize the error on the training data, while regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. By separating these two concerns, you can independently control the complexity of the model (using regularization) and its ability to fit the training data (using the model fitting process).

In summary, orthogonalization in machine learning is the process of designing models or algorithms with separate, independent components or hyperparameters that control different aspects of the model's behavior. This makes it easier to optimize and tune the model, as each component can be adjusted independently without affecting the others.

## Chain of Assumptions in ML

- Fit training set well on cost function
  - Train bigger network
  - switch optimization
- Fit dev set well on cost function
  - Regularization
  - bigger training set
- Fit test set well on cost function
  - Bigger dev set (if not working)
- Performs well in real world
  - Change dev set, or cost function (if it does not do well)

## Precision and Recall
Precision and recall are two important metrics used in machine learning to evaluate the performance of classification models, especially in situations where there is an imbalance between the classes. Both metrics provide insights into the model's ability to correctly identify positive instances, but they focus on different aspects.

**Precision:**
Precision, also known as Positive Predictive Value (PPV), is the ratio of true positive instances (correctly identified positive instances) to the total number of instances predicted as positive by the model (both true positives and false positives). It measures the model's ability to correctly identify only the relevant instances as positive.

Precision = (True Positives) / (True Positives + False Positives)

A high precision indicates that the model is very accurate when it predicts a positive instance, but it does not necessarily mean that the model identifies all the positive instances.

**Recall:**
Recall, also known as Sensitivity or True Positive Rate (TPR), is the ratio of true positive instances (correctly identified positive instances) to the total number of actual positive instances (both true positives and false negatives). It measures the model's ability to identify all the relevant instances as positive.

Recall = (True Positives) / (True Positives + False Negatives)

A high recall indicates that the model is very good at identifying positive instances, but it does not necessarily mean that the model is precise when it predicts a positive instance.

In summary, precision measures the accuracy of the model when it predicts a positive instance, while recall measures the model's ability to identify all the positive instances. Depending on the specific problem and the desired trade-offs, one might choose to optimize for precision, recall, or a combination of both (such as the F1-score, which is the harmonic mean of precision and recall).

## Single number evaluation metric
You don't want to use multiple metrics (or multiple numbers) such as precision and recall together

If you wanted to combine precision and recall into one metric which you could use to train the network you can use a F1 score which is the harmonic mean of the two values.

This equates to:

F1 = (2) / ((1/precision) + (1/recall))

## USE A DEV SET AND A SINGLE EVALUATION METRIC

## Satisficing and Optimizing Metric
Lets say that you care about the accuracy and running time of the network. You want to make sure than you maximize your accuracy value while ensuring that your runtime is less than 100 ms

Under this scenario, accuracy would be an "optimizing metric" as you want to improve it as much as possible while Running time is a "satisficing metric" as you are only interested in satisfying this requirement.

A good way to approach the optimization and design of a neural network for N metrics is to:
- choose one metric as the optimizing metric which you want to maximise as much as possible
- the remaining N-1 metrics are satisficing meaning you have a target threshold that you must meet, but are not attempting to optimize beyond your threshold.

#### Wakewords example

Under another scenario where you are attempting to listen for "wakeworks" or "trigger words". This would be "Ok Google", "alexa", etc.

With this model, you are interested in the accuracy of the model (reducing false negatives) and you are interested in the number of false positives.

One way to approach this is to set accuracy as the optimizing metric while setting false positives as a satisfying metric (say, 1 false positive every 24 hours). This makes it much easier to choose the best model and simplifies the development process for Deep Networks.

## Setting-up Training, Development, and Test Sets
Please note that the Dev set is sometimes called the hold out cross-validation set.

YOU MUST ENSURE THAT THE DATA IN ALL YOUR SETS CONTAINS THE SAME DISTRABUTION OF DATA. 

## What is the difference between the test, dev, and training sets?
In deep learning and machine learning, a dataset is often split into three parts: the training set, the development (dev) set, and the test set. These sets serve different purposes in the model development process:

Training set: The training set is the largest portion of the dataset, and it is used to train the model. The model learns the patterns and relationships between the input features and the target variable using this data. The training set helps the model learn how to make predictions and adjust its parameters to minimize the error.

Development (dev) set: The dev set, also known as the validation set, is a separate portion of the dataset that is not used for training. It is used to evaluate the performance of the model during the model development process. The dev set helps in fine-tuning the model's hyperparameters, selecting the best model architecture, and preventing overfitting. By using the dev set, you can compare different models or model configurations and choose the one that performs best on the dev set.

Test set: The test set is another separate portion of the dataset that is used to evaluate the final performance of the model after training and hyperparameter tuning. It provides an unbiased estimate of the model's performance on unseen data. The test set should only be used once, at the end of the model development process, to avoid leaking information from the test set into the model during training or hyperparameter tuning.

In summary, the main difference between a dev set and a test set in deep learning is their purpose in the model development process. The dev set is used to evaluate and fine-tune the model during the development process, while the test set is used to assess the final performance of the model on unseen data after the development process is complete.

### How to pick sizes?
Choose a dev and test set to reflect the data you expect to get in the future and consider important to do well on.

## How large should the different sets be?
First off, you should not really ever create a network without all three. 

The old rule used to be 
- 70% training, 30% test
- OR
- 60% training, 20% dev, 20% test

In the modern machine learning era, things have changed somewhat. This is largely to do with the huge increase of information and data available for training sets.

If you have 1,000,000 examples, for example:
- 98% train
- 1% dev
- 1% test

### What about the test set?
The test set needs to be large enough that you can have a high confidence in the overall performance of your system.

## When to change dev/test sets and metrics
Your metrics combined with your test and dev sets functions in a way as the target which you attempting to "hit the bullseye" on. Through the process of developing a network, however, there are sometimes circumstances where you need to figuratively move your target. 

Lets say that you have two algorithms for showing people images of cats. 
- Model A produces a 3% error
- Model B produces a 5% error
- Model A sometimes shows pornographic images to users, but Model B does not show any pornographic images.

Under this circumstance, Model B is the better model accoring to the users and the company but your cost function thinks that model B is better due to it's cost function.

This is a sign that you should change you classifcation method, lets say you are using a cost function similar to this:
- error = 1/mdev * np.sum(indicator_function(ypred[i], y[i]))
- in this case the indicator_function simply counts up the number of misclassified examples
- You can update the function to include a weight value w[i] which is calculated to be either 1 if the image is non-pornographic and some high number such as 10 or 100 if the image is pornographic.
- You also have to then update your 1/mdev value to 1/mdev*w[i]
- This gives you the updated function below:
  - 1/(mdev * w[i]) * np.sum(w[i] * indicator_function(ypred[i], y[i]))
  - note: you need to then label any pornographic images in your dataset so you can cross-evaluate against this label in addition to cat or no cat.

  TAKEAWAY: If your error metric is giveing you unacceptable results, dont keep working with it, change it and start over.

  - FIRST: choose a metric that is best for the problem
  - SECOND: figure out how to optimize the model to do well with the selected metric
  - You can think of this as choosing the target in the first step and then aiming and shooting at the target in the second step.

IMPORTANT - If you are doing well on your metric and dev/test set but this does not correspond well to your application you should change your metric or update your dev/test set.

## Why compare models to human-level performance
Now in the era of deep learning as ML models have drastically improved in their capabilities, comparing their performance to humans has become more resonable.

Typically, ML models progress fairly rapidly until they reach a level of human performance. After they surpass human-level performance, they increase much more slowly. 

There is a theoritical maximum performance level for models which is called the Bayes optimal error. This is effectivly the "best" possible error. E.g. some images are so blurry that no one and nothing can determine if it is a cat or not. The Bayes optimal error can not be surpassed, it is a total limit.

This is for several reasons. Typically, human-level error is very close to Bayes level error (humans a very good at detecting cats).

For tasks that Humans are good at where the ML is worse than humans there are a few things you can do to improve the performance of the model.
- get labled data from humans
- Gain instight from manual error analysis (why did the human get it right?)
- Better analysis of bias/variance

## What is avoidable bias?
You want your model to perform well on the training set, but often dont want it to perform much better than a human on the training set as this is an indication of overfitting. 

If human error is 1%, training is 8%, and dev is 10% you want to focus on reducing bias.

If instead human error is 7.5%, training is 8%, and dev is 10%. In this case, the bias of the model is acceptable, and it would be better to focus on reducing the variance of the model.

For some tasks, such as CV, human-level error can serve as a proxy for Bayes error as humans are extremely good at detecting cats in images.

**TAKEAWAY**: Both of these examples had the same error for the training and dev sets but the optimization methods recommended changed (bias vs. variance) according to what our Bayes error is. This is because the bias and variance is dependent on what the Bayes error is.

You can think of the difference between Bayes error and your current performance as the Avoidable bias where the difference between the training error and dev error is the variance.

## Understanding Human-level performance
Lets take medical images as an example. Lets say we are looking at X-rays to provide a diagnostic. Lets say that we have different types of human-level error.
- Typical Human - 3% error
- Typical Doctor - 1% error
- Experiences Doctor - 0.7% error
- Team of Experienced Doctors - 0.5% error

Under this circumstance, what is the human-level error?
- A good way to think about human error is as a proxy for Beyes error.
- If that is the case, then 0.5% will be the best approximation of Beyes Error as we know that Beyes error has to be lower than 0.5%.
- If you are publishing a paper or have a different use-case then applying this model() in the real world, a resonable human-level error would be a typical doctor @ 1%.
- So, the purpose of the model somewhat determines what metric to use

# Surpassing Human-Level Performance
There are relatively few use-cases where machine learning is able to outperform humans. Some of the key use-cases where it can surpass human ability are:
- Online advertising
- Product recommendations
- logistics
- loan approvals

All these examples are using structured data, databases full of information about individual people. These are not NLP, CV, or "natural perception" tasks. Furthermore, these usecases are able to draw on a HUGE amount of data.

Here are some natural perception tasks which have acheived human-level performance. This includes:
- Speech recognition
- Some image recognition
- Some medical applications (ecg, skin cancer, etc.)

## Improving your Model Performance Guidelines
The two fundamental assumptions of supervised learning
- You can fir the training set well (low avoidable bias)
- The training set performance generaizes pretty well to the dev/test set

Reducing (avoidable) bias and variance
- look at difference between Human-level error and training error to determine your bias
  - If your avoidable bias is greater than the variance, then you want to implement these strategies:
  - Train a bigger model
  - Train the model for longer or with a better optimization algorithm (momentum, RMSprop, adam, etc.)
  - NN architecture / hyper-parameter searching
- Look at the difference between your training set and dev set to determine 
  - More data
  - regularization (L2, dropout, data augmentation)
  - NN architecture / hyper-parameter searching

## Error Analysis (manually tuning parameters)
Lets say you cat classifier is doing poorly at distinguishing Dogs from cats. You discover this by looking at some of the images that your model incorrectly classified. What should you do?

**Error Analysis**
- get ~100 mislabled dev set examples
- count how many are dogs
- determine what percent of the images are of dogs to determine how much performance can be improved by working on the "dog" problem
- if only 5 are dog images, this optimization will not help very much
- alternatively, if 50 are dogs, much more can be accomplished with this method.

### Evaluate multiple ideas in parallel
Lets say you have multiple ways to improve your cat detection routine:
- Fix pictures of dogs being recognized as cats
- Fix great cats (lions, tigers, etc. ) from being misrecognized
- improve performance on blurry images

To conduct error analysis create a spreadsheet with your different categories and tally up the mistakes of each category. Then tally up what percentage of images are a part of each category. Using this information, you should be able to determine what type of errors are most common and then focus on optimizing those problems.

When conducting this error analysis process there is a chance that you will discover some images which are mislabled.
- In general DL malgorithms are quite robust to random errors in the training set. As long as the mislabled images are generally random, this is typically not a huge issue. To be clear, fixing these labels will result in better performance, but it is likely not worth the time to manually review every example with the aim of fixing errors.
- DL algorithms are MUCH LESS robust when systematic errors are present (consistantly labeling a dog as a cat for instance.)
- If you decide to go in and correct mislabled samples you should conduct the same process to your dev and test sets to ensure they continue to comes from the same distrabution.
- consider the exmaples of samples that the model() got right as well as the ones it got wrong (or else you give it an unfair advantage)
- it is less important to update the data contained in the training set. It is MUCH more important that the dev and test set come from the same distrabution.
- Overall, the process of Error analysis does not take very much time but can give you valuable insight into what areas to focus on when further developing your model()

## Build your first system quickly, then iterate
To be efficient at this:
- Setup dev/test set and metric
- build initial system quick and see how it performs
- Then use Bias/Variance analysis & Error analysis to prioritize the next steps.
- It is VERY common for machine learning practicioners to build systems that are too complex for the task at hand. This is more common than teams building systems that are too simple.


## Training and testing on different distrabutions
In the era of big data and deep learning it is not too uncommon for machine learning practicioners to gather tons of extra data that they add to their training set to increase its size while keeping that data largely out of the dev and test sets. This results in an unever distrabution between the training and other sets. 

There are some strategies that can be deployed to combat the possible issues that arise from this mismatch.

## Should you always use all the data you have?
If you are using mismatched distrabutions of data between your test/dev set and training set and end up with a training error of 1% and dev error of 10% it is hard to know if this variance is due to the mismatched distrabution or if there actually is a high variance problem. To solve this issue you can create a training-dev set which is a set with the same distrabution as the trianing set but which is not used for training. Lets say you then train the model on the training set (NOT THE TRAIING-DEV SET) and get an error of 1% and then test the model on your dev-training set with an error of 9% and the dev set with an error of 10%. This tells us that the model does indeed have a high bias (of around 8%) and that the mismatched distrabution is causing an error of around 1%.

These error values tell you tons:
- human-level performance
  - difference between these two tells you the avoidable bias
- Training set error
  - difference between these two tells you the variance
- Training-dev error
  - difference between these two tells you the data mismatch
- dev error
  - degree of overfitting to dev set
- test error










